1
|
Walsh J, Dwumfour C, Cave J, Griffiths F. Spontaneously generated online patient experience data - how and why is it being used in health research: an umbrella scoping review. BMC Med Res Methodol 2022; 22:139. [PMID: 35562661 PMCID: PMC9106384 DOI: 10.1186/s12874-022-01610-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Accepted: 04/13/2022] [Indexed: 11/10/2022] Open
Abstract
PURPOSE Social media has led to fundamental changes in the way that people look for and share health related information. There is increasing interest in using this spontaneously generated patient experience data as a data source for health research. The aim was to summarise the state of the art regarding how and why SGOPE data has been used in health research. We determined the sites and platforms used as data sources, the purposes of the studies, the tools and methods being used, and any identified research gaps. METHODS A scoping umbrella review was conducted looking at review papers from 2015 to Jan 2021 that studied the use of SGOPE data for health research. Using keyword searches we identified 1759 papers from which we included 58 relevant studies in our review. RESULTS Data was used from many individual general or health specific platforms, although Twitter was the most widely used data source. The most frequent purposes were surveillance based, tracking infectious disease, adverse event identification and mental health triaging. Despite the developments in machine learning the reviews included lots of small qualitative studies. Most NLP used supervised methods for sentiment analysis and classification. Very early days, methods need development. Methods not being explained. Disciplinary differences - accuracy tweaks vs application. There is little evidence of any work that either compares the results in both methods on the same data set or brings the ideas together. CONCLUSION Tools, methods, and techniques are still at an early stage of development, but strong consensus exists that this data source will become very important to patient centred health research.
Collapse
Affiliation(s)
- Julia Walsh
- Warwick Medical School, University of Warwick, Coventry, UK.
| | | | - Jonathan Cave
- Department of Economics, University of Warwick, Coventry, UK
| | - Frances Griffiths
- Warwick Medical School, University of Warwick, Coventry, UK.,Centre for Health Policy, University of the Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
2
|
Mitchell JR, Szepietowski P, Howard R, Reisman P, Jones JD, Lewis P, Fridley BL, Rollison DE. A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study. J Med Internet Res 2022; 24:e27210. [PMID: 35319481 PMCID: PMC8987958 DOI: 10.2196/27210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 10/22/2021] [Accepted: 11/10/2021] [Indexed: 11/30/2022] Open
Abstract
Background Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more. Objective The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports. Methods We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars. Results caBERTnet’s accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively. Conclusions We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.
Collapse
Affiliation(s)
- Joseph Ross Mitchell
- Department of Machine Learning, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.,Department of Medicine, Faculty of Medicine & Dentistry, and the Alberta Machine Intelligence Institute, University of Alberta, Edmonton, AB, Canada.,Alberta Health Services, Edmonton, AB, Canada
| | - Phillip Szepietowski
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Rachel Howard
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Phillip Reisman
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Jennie D Jones
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Patricia Lewis
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Dana E Rollison
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| |
Collapse
|
3
|
Yan MY, Gustad LT, Nytrø Ø. Sepsis prediction, early detection, and identification using clinical text for machine learning: a systematic review. J Am Med Inform Assoc 2022; 29:559-575. [PMID: 34897469 PMCID: PMC8800516 DOI: 10.1093/jamia/ocab236] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Revised: 09/11/2021] [Accepted: 10/11/2021] [Indexed: 12/24/2022] Open
Abstract
OBJECTIVE To determine the effects of using unstructured clinical text in machine learning (ML) for prediction, early detection, and identification of sepsis. MATERIALS AND METHODS PubMed, Scopus, ACM DL, dblp, and IEEE Xplore databases were searched. Articles utilizing clinical text for ML or natural language processing (NLP) to detect, identify, recognize, diagnose, or predict the onset, development, progress, or prognosis of systemic inflammatory response syndrome, sepsis, severe sepsis, or septic shock were included. Sepsis definition, dataset, types of data, ML models, NLP techniques, and evaluation metrics were extracted. RESULTS The clinical text used in models include narrative notes written by nurses, physicians, and specialists in varying situations. This is often combined with common structured data such as demographics, vital signs, laboratory data, and medications. Area under the receiver operating characteristic curve (AUC) comparison of ML methods showed that utilizing both text and structured data predicts sepsis earlier and more accurately than structured data alone. No meta-analysis was performed because of incomparable measurements among the 9 included studies. DISCUSSION Studies focused on sepsis identification or early detection before onset; no studies used patient histories beyond the current episode of care to predict sepsis. Sepsis definition affects reporting methods, outcomes, and results. Many methods rely on continuous vital sign measurements in intensive care, making them not easily transferable to general ward units. CONCLUSIONS Approaches were heterogeneous, but studies showed that utilizing both unstructured text and structured data in ML can improve identification and early detection of sepsis.
Collapse
Affiliation(s)
- Melissa Y Yan
- Department of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology, Trondheim, Norway
| | - Lise Tuset Gustad
- Department of Circulation and Medical Imaging, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway
- Department of Medicine, Levanger Hospital, Clinic of Medicine and Rehabilitation, Nord-Trøndelag Hospital Trust, Levanger, Norway
| | - Øystein Nytrø
- Department of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
4
|
Santiso S, Pérez A, Casillas A. Adverse Drug Reaction extraction: Tolerance to entity recognition errors and sub-domain variants. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 199:105891. [PMID: 33333368 DOI: 10.1016/j.cmpb.2020.105891] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Accepted: 11/24/2020] [Indexed: 06/12/2023]
Affiliation(s)
- Sara Santiso
- IXA research group, University of the Basque Country (UPV/EHU) Manuel Lardizabal 1, 20080 Donostia, Spain.
| | - Alicia Pérez
- IXA research group, University of the Basque Country (UPV/EHU) Manuel Lardizabal 1, 20080 Donostia, Spain.
| | - Arantza Casillas
- IXA research group, University of the Basque Country (UPV/EHU) Manuel Lardizabal 1, 20080 Donostia, Spain.
| |
Collapse
|
5
|
Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, Soni S, Wang Q, Wei Q, Xiang Y, Zhao B, Xu H. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc 2021; 27:457-470. [PMID: 31794016 DOI: 10.1093/jamia/ocz200] [Citation(s) in RCA: 162] [Impact Index Per Article: 54.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/15/2019] [Accepted: 11/09/2019] [Indexed: 02/07/2023] Open
Abstract
OBJECTIVE This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. MATERIALS AND METHODS We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. RESULTS DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a "long tail" of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. DISCUSSION Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). CONCLUSION Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.
Collapse
Affiliation(s)
- Stephen Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Surabhi Datta
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingcheng Du
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Zongcheng Ji
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yuqi Si
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Sarvesh Soni
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Qiong Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Qiang Wei
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yang Xiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Bo Zhao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
6
|
Fu JT, Sholle E, Krichevsky S, Scandura J, Campion TR. Extracting and classifying diagnosis dates from clinical notes: A case study. J Biomed Inform 2020; 110:103569. [PMID: 32949781 DOI: 10.1016/j.jbi.2020.103569] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 08/24/2020] [Accepted: 09/12/2020] [Indexed: 11/29/2022]
Abstract
Myeloproliferative neoplasms (MPNs) are chronic hematologic malignancies that may progress over long disease courses. The original date of diagnosis is an important piece of information for patient care and research, but is not consistently documented. We describe an attempt to build a pipeline for extracting dates with natural language processing (NLP) tools and techniques and classifying them as relevant diagnoses or not. Inaccurate and incomplete date extraction and interpretation impacted the performance of the overall pipeline. Existing lightweight Python packages tended to have low specificity for identifying and interpreting partial and relative dates in clinical text. A rules-based regular expression (regex) approach achieved recall of 83.0% on dates manually annotated as diagnosis dates, and 77.4% on all annotated dates. With only 3.8% of annotated dates representing initial MPN diagnoses, additional methods of targeting candidate date instances may alleviate noise and class imbalance.
Collapse
Affiliation(s)
- Julia T Fu
- Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Division of Health Informatics, Memorial Sloan Kettering Cancer Center, 600 3rd Ave, 8th Fl, New York, NY 10016, United States.
| | - Evan Sholle
- Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Information Technologies & Services, Weill Cornell Medicine, 575 Lexington Ave, 3rd Fl, New York, NY 10022, United States.
| | - Spencer Krichevsky
- Joint Clinical Trials Office, Weill Cornell Medicine, 1300 York Ave, Box 305, New York, NY 10065, United States.
| | - Joseph Scandura
- Department of Hematology and Oncology, Weill Cornell Medicine, 428 E 72nd St, Ste 300, New York, NY 10065, United States.
| | - Thomas R Campion
- Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Information Technologies & Services, Weill Cornell Medicine, 575 Lexington Ave, 3rd Fl, New York, NY 10022, United States; Clinical and Translational Science Center, Weill Cornell Medicine, 1300 York Ave., Box 149, New York, NY 10065, United States; Department of Pediatrics, Weill Cornell Medicine, 525 E 68th St, Rm M610A, New York, NY 10065, United States.
| |
Collapse
|
7
|
Idakwo G, Thangapandian S, Luttrell J, Zhou Z, Zhang C, Gong P. Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals With High-Throughput Cell-Based Androgen Receptor Bioassay Data. Front Physiol 2019; 10:1044. [PMID: 31456700 PMCID: PMC6700714 DOI: 10.3389/fphys.2019.01044] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Accepted: 07/30/2019] [Indexed: 12/15/2022] Open
Abstract
Deep learning (DL) has attracted the attention of computational toxicologists as it offers a potentially greater power for in silico predictive toxicology than existing shallow learning algorithms. However, contradicting reports have been documented. To further explore the advantages of DL over shallow learning, we conducted this case study using two cell-based androgen receptor (AR) activity datasets with 10K chemicals generated from the Tox21 program. A nested double-loop cross-validation approach was adopted along with a stratified sampling strategy for partitioning chemicals of multiple AR activity classes (i.e., agonist, antagonist, inactive, and inconclusive) at the same distribution rates amongst the training, validation and test subsets. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p < 0.001, ANOVA) by 22–27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Further in-depth analyses of chemical scaffolding shed insights on structural alerts for AR agonists/antagonists and inactive/inconclusive compounds, which may aid in future drug discovery and improvement of toxicity prediction modeling.
Collapse
Affiliation(s)
- Gabriel Idakwo
- School of Computing Sciences and Computer Engineering, The University of Southern Mississippi, Hattiesburg, MS, United States
| | - Sundar Thangapandian
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, United States
| | - Joseph Luttrell
- School of Computing Sciences and Computer Engineering, The University of Southern Mississippi, Hattiesburg, MS, United States
| | - Zhaoxian Zhou
- School of Computing Sciences and Computer Engineering, The University of Southern Mississippi, Hattiesburg, MS, United States
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, The University of Southern Mississippi, Hattiesburg, MS, United States
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, United States
| |
Collapse
|