1
|
Magoc T, Everson R, Harle CA. Enhancing an enterprise data warehouse for research with data extracted using natural language processing. J Clin Transl Sci 2023; 7:e149. [PMID: 37456264 PMCID: PMC10346024 DOI: 10.1017/cts.2023.575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2023] [Revised: 05/14/2023] [Accepted: 05/31/2023] [Indexed: 07/18/2023] Open
Abstract
Objective This study aims to develop a generalizable architecture for enhancing an enterprise data warehouse for research (EDW4R) with results from a natural language processing (NLP) model, which allows discrete data derived from clinical notes to be made broadly available for research use without need for NLP expertise. The study also quantifies the additional value that information extracted from clinical narratives brings to EDW4R. Materials and methods Clinical notes written during one month at an academic health center were used to evaluate the performance of an existing NLP model and to quantify its value added to the structured data. Manual review was utilized for performance analysis. The architecture for enhancing the EDW4R is described in detail to enable reproducibility. Results Two weeks were needed to enhance EDW4R with data from 250 million clinical notes. NLP generated 16 and 39% increase in data availability for two variables. Discussion Our architecture is highly generalizable to a new NLP model. The positive predictive value obtained by an independent team showed only slightly lower NLP performance than the values reported by the NLP developers. The NLP showed significant value added to data already available in structured format. Conclusion Given the value added by data extracted using NLP, it is important to enhance EDW4R with these data to enable research teams without NLP expertise to benefit from value added by NLP models.
Collapse
Affiliation(s)
- Tanja Magoc
- College of Medicine, University of Florida, Gainesville, FL, USA
| | | | | |
Collapse
|
2
|
Bozkurt S, Magnani CJ, Seneviratne MG, Brooks JD, Hernandez-Boussard T. Expanding the Secondary Use of Prostate Cancer Real World Data: Automated Classifiers for Clinical and Pathological Stage. Front Digit Health 2022; 4:793316. [PMID: 35721793 PMCID: PMC9201076 DOI: 10.3389/fdgth.2022.793316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Accepted: 05/12/2022] [Indexed: 11/30/2022] Open
Abstract
Background Explicit documentation of stage is an endorsed quality metric by the National Quality Forum. Clinical and pathological cancer staging is inconsistently recorded within clinical narratives but can be derived from text in the Electronic Health Record (EHR). To address this need, we developed a Natural Language Processing (NLP) solution for extraction of clinical and pathological TNM stages from the clinical notes in prostate cancer patients. Methods Data for patients diagnosed with prostate cancer between 2010 and 2018 were collected from a tertiary care academic healthcare system's EHR records in the United States. This system is linked to the California Cancer Registry, and contains data on diagnosis, histology, cancer stage, treatment and outcomes. A randomly selected sample of patients were manually annotated for stage to establish the ground truth for training and validating the NLP methods. For each patient, a vector representation of clinical text (written in English) was used to train a machine learning model alongside a rule-based model and compared with the ground truth. Results A total of 5,461 prostate cancer patients were identified in the clinical data warehouse and over 30% were missing stage information. Thirty-three to thirty-six percent of patients were missing a clinical stage and the models accurately imputed the stage in 21–32% of cases. Twenty-one percent had a missing pathological stage and using NLP 71% of missing T stages and 56% of missing N stages were imputed. For both clinical and pathological T and N stages, the rule-based NLP approach out-performed the ML approach with a minimum F1 score of 0.71 and 0.40, respectively. For clinical M stage the ML approach out-performed the rule-based model with a minimum F1 score of 0.79 and 0.88, respectively. Conclusions We developed an NLP pipeline to successfully extract clinical and pathological staging information from clinical narratives. Our results can serve as a proof of concept for using NLP to augment clinical and pathological stage reporting in cancer registries and EHRs to enhance the secondary use of these data.
Collapse
Affiliation(s)
- Selen Bozkurt
- Department of Medicine (Biomedical Informatics), Stanford University, Stanford, CA, United States
| | | | - Martin G. Seneviratne
- Department of Medicine (Biomedical Informatics), Stanford University, Stanford, CA, United States
| | - James D. Brooks
- School of Medicine, Stanford University, Stanford, CA, United States
| | - Tina Hernandez-Boussard
- Department of Medicine (Biomedical Informatics), Stanford University, Stanford, CA, United States
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, United States
- *Correspondence: Tina Hernandez-Boussard
| |
Collapse
|
3
|
Davila JR, Singh K, Hernandez-Boussard T, Wang S. Outcomes of Primary Trabeculectomy versus Combined Phacoemulsification-Trabeculectomy Using Automated Electronic Health Record Data Extraction. Curr Eye Res 2022; 47:923-929. [PMID: 35317681 PMCID: PMC10000312 DOI: 10.1080/02713683.2022.2045611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
PURPOSE Cataract is a known effect of trabeculectomy (TE), but some surgeons are hesitant to perform combined phacoemulsification-TE (PTE) due to a risk of increased TE failure. Herein, we compare intraocular pressure (IOP) lowering between trabeculectomy (TE) and phacoemulsification-TE (PTE) and investigate factors that impact patient outcomes. METHODS We performed a retrospective study of adults undergoing primary TE or PTE at our institution from 2010 to 2017. We used Kaplan-Meier survival analysis to investigate time to TE failure, and Cox proportional hazards modeling to investigate predictors of TE failure, defined as undergoing a second glaucoma surgery or using more IOP-lowering medications than pre-operatively. RESULTS 318 surgeries (218 TE; 100 PTE) from 268 patients were included. Median follow-up time was 753 days. Mean baseline IOP was 21.1 mmHg. There were no significant differences in IOP between TE and PTE groups beyond postoperative year 1, with 28.9-46.5% of TE and 35.5-44.4% of PTE groups achieving IOP ≤10. Final IOP was similar in both groups (p = 0.22): 12.41 (SD 4.18) mmHg in the TE group and 14.05 (SD 5.45) in the PTE group. 84 (26.4%) surgeries met failure criteria. After adjusting for surgery type, sex, age, race, surgeon, and glaucoma diagnosis there were no significant differences in TE failure. CONCLUSION This study suggests there is no significant difference in the risk of TE failure in patients receiving TE versus those receiving PTE.
Collapse
Affiliation(s)
- Jose R Davila
- Byers Eye Institute, Department of Ophthalmology, Stanford University, Palo Alto, CA, USA
| | - Kuldev Singh
- Byers Eye Institute, Department of Ophthalmology, Stanford University, Palo Alto, CA, USA
| | - Tina Hernandez-Boussard
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Sophia Wang
- Byers Eye Institute, Department of Ophthalmology, Stanford University, Palo Alto, CA, USA
| |
Collapse
|
4
|
Lee KS, Shin DG, Hwang JH, Kim R, Han CH, Yoo J. Construction of a bone marrow report registry using a clinical data warehouse. Int J Lab Hematol 2021; 44:e140-e144. [PMID: 34889526 DOI: 10.1111/ijlh.13781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 11/30/2021] [Indexed: 11/28/2022]
Affiliation(s)
- Kwang Seob Lee
- Department of Laboratory Medicine, Yonsei University College of Medicine, Seoul, Korea
| | - Dong-Gyo Shin
- Medical Record Service Team, National Health Insurance Service Ilsan Hospital, Goyang, Korea
| | - Jin-Hee Hwang
- Medical Record Service Team, National Health Insurance Service Ilsan Hospital, Goyang, Korea
| | - Ranhee Kim
- Medical Record Service Team, National Health Insurance Service Ilsan Hospital, Goyang, Korea
| | - Chang Hoon Han
- Division of Biomedical Informatics, Departments of Internal Medicine, National Health Insurance Service Ilsan Hospital, Goyang, Korea
| | - Jongha Yoo
- Division of Biomedical Informatics, Department of Laboratory Medicine, National Health Insurance Service Ilsan Hospital, Goyang, Korea
| |
Collapse
|
5
|
Eschrich SA, Teer JK, Reisman P, Siegel E, Challa C, Lewis P, Fellows K, Malpica E, Carvajal R, Gonzalez G, Cukras S, Betin-Montes M, Aden-Buie G, Avedon M, Manning D, Tan AC, Fridley BL, Gerke T, Van Looveren M, Blake A, Greenman J, Rollison D. Enabling Precision Medicine in Cancer Care Through a Molecular Data Warehouse: The Moffitt Experience. JCO Clin Cancer Inform 2021; 5:561-569. [PMID: 33989014 DOI: 10.1200/cci.20.00175] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE The use of genomics within cancer research and clinical oncology practice has become commonplace. Efforts such as The Cancer Genome Atlas have characterized the cancer genome and suggested a wealth of targets for implementing precision medicine strategies for patients with cancer. The data produced from research studies and clinical care have many potential secondary uses beyond their originally intended purpose. Effective storage, query, retrieval, and visualization of these data are essential to create an infrastructure to enable new discoveries in cancer research. METHODS Moffitt Cancer Center implemented a molecular data warehouse to complement the extensive enterprise clinical data warehouse (Health and Research Informatics). Seven different sequencing experiment types were included in the warehouse, with data from institutional research studies and clinical sequencing. RESULTS The implementation of the molecular warehouse involved the close collaboration of many teams with different expertise and a use case-focused approach. Cornerstones of project success included project planning, open communication, institutional buy-in, piloting the implementation, implementing custom solutions to address specific problems, data quality improvement, and data governance, unique aspects of which are featured here. We describe our experience in selecting, configuring, and loading molecular data into the molecular data warehouse. Specifically, we developed solutions for heterogeneous genomic sequencing cohorts (many different platforms) and integration with our existing clinical data warehouse. CONCLUSION The implementation was ultimately successful despite challenges encountered, many of which can be generalized to other research cancer centers.
Collapse
Affiliation(s)
- Steven A Eschrich
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL
| | - Jamie K Teer
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL
| | | | - Erin Siegel
- Total Cancer Care, Moffitt Cancer Center, Tampa, FL
| | | | - Patricia Lewis
- Data Quality and Business Intelligence, Moffitt Cancer Center, Tampa, FL
| | - Katherine Fellows
- Data Quality and Business Intelligence, Moffitt Cancer Center, Tampa, FL
| | | | - Rodrigo Carvajal
- Biostatistics and Bioinformatics Shared Resource, Moffitt Cancer Center, Tampa, FL
| | - Guillermo Gonzalez
- Biostatistics and Bioinformatics Shared Resource, Moffitt Cancer Center, Tampa, FL
| | - Scott Cukras
- Biostatistics and Bioinformatics Shared Resource, Moffitt Cancer Center, Tampa, FL
| | - Miguel Betin-Montes
- Biostatistics and Bioinformatics Shared Resource, Moffitt Cancer Center, Tampa, FL
| | | | - Melissa Avedon
- Basic, Population, and Quantitative Science Shared Resource Administration, Moffitt Cancer Center, Tampa, FL
| | - Daniel Manning
- Information Technology, Moffitt Cancer Center, Tampa, FL
| | - Aik Choon Tan
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL
| | - Brooke L Fridley
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL
| | - Travis Gerke
- Health Informatics, Moffitt Cancer Center, Tampa, FL
| | | | | | | | - Dana Rollison
- Department of Epidemiology, Moffitt Cancer Center, Tampa, FL
| |
Collapse
|
6
|
Evaluation of clustering and topic modeling methods over health-related tweets and emails. Artif Intell Med 2021; 117:102096. [PMID: 34127235 DOI: 10.1016/j.artmed.2021.102096] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 03/30/2021] [Accepted: 05/05/2021] [Indexed: 01/31/2023]
Abstract
BACKGROUND Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. METHODS We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). RESULTS In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets. CONCLUSIONS Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.
Collapse
|
7
|
Coquet J, Bievre N, Billaut V, Seneviratne M, Magnani CJ, Bozkurt S, Brooks JD, Hernandez-Boussard T. Assessment of a Clinical Trial-Derived Survival Model in Patients With Metastatic Castration-Resistant Prostate Cancer. JAMA Netw Open 2021; 4:e2031730. [PMID: 33481032 PMCID: PMC7823224 DOI: 10.1001/jamanetworkopen.2020.31730] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
IMPORTANCE Randomized clinical trials (RCTs) are considered the criterion standard for clinical evidence. Despite their many benefits, RCTs have limitations, such as costliness, that may reduce the generalizability of their findings among diverse populations and routine care settings. OBJECTIVE To assess the performance of an RCT-derived prognostic model that predicts survival among patients with metastatic castration-resistant prostate cancer (CRPC) when the model is applied to real-world data from electronic health records (EHRs). DESIGN, SETTING, AND PARTICIPANTS The RCT-trained model and patient data from the RCTs were obtained from the Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge for prostate cancer, which occurred from March 16 to July 27, 2015. This challenge included 4 phase 3 clinical trials of patients with metastatic CRPC. Real-world data were obtained from the EHRs of a tertiary care academic medical center that includes a comprehensive cancer center. In this study, the DREAM challenge RCT-trained model was applied to real-world data from January 1, 2008, to December 31, 2019; the model was then retrained using EHR data with optimized feature selection. Patients with metastatic CRPC were divided into RCT and EHR cohorts based on data source. Data were analyzed from March 23, 2018, to October 22, 2020. EXPOSURES Patients who received treatment for metastatic CRPC. MAIN OUTCOMES AND MEASURES The primary outcome was the performance of an RCT-derived prognostic model that predicts survival among patients with metastatic CRPC when the model is applied to real-world data. Model performance was compared using 10-fold cross-validation according to time-dependent integrated area under the curve (iAUC) statistics. RESULTS Among 2113 participants with metastatic CRPC, 1600 participants were included in the RCT cohort, and 513 participants were included in the EHR cohort. The RCT cohort comprised a larger proportion of White participants (1390 patients [86.9%] vs 337 patients [65.7%]) and a smaller proportion of Hispanic participants (14 patients [0.9%] vs 42 patients [8.2%]), Asian participants (41 patients [2.6%] vs 88 patients [17.2%]), and participants older than 75 years (388 patients [24.3%] vs 191 patients [37.2%]) compared with the EHR cohort. Participants in the RCT cohort also had fewer comorbidities (mean [SD], 1.6 [1.8] comorbidities vs 2.5 [2.6] comorbidities, respectively) compared with those in the EHR cohort. Of the 101 variables used in the RCT-derived model, 10 were not available in the EHR data set, 3 of which were among the top 10 features in the DREAM challenge RCT model. The best-performing EHR-trained model included only 25 of the 101 variables included in the RCT-trained model. The performance of the RCT-trained and EHR-trained models was adequate in the EHR cohort (mean [SD] iAUC, 0.722 [0.118] and 0.762 [0.106], respectively); model optimization was associated with improved performance of the best-performing EHR model (mean [SD] iAUC, 0.792 [0.097]). The EHR-trained model classified 256 patients as having a high risk of mortality and 256 patients as having a low risk of mortality (hazard ratio, 2.7; 95% CI, 2.0-3.7; log-rank P < .001). CONCLUSIONS AND RELEVANCE In this study, although the RCT-trained models did not perform well when applied to real-world EHR data, retraining the models using real-world EHR data and optimizing variable selection was beneficial for model performance. As clinical evidence evolves to include more real-world data, both industry and academia will likely search for ways to balance model optimization with generalizability. This study provides a pragmatic approach to applying RCT-trained models to real-world data.
Collapse
Affiliation(s)
- Jean Coquet
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - Nicolas Bievre
- Department of Statistics, Stanford University, Stanford, California
| | - Vincent Billaut
- Department of Statistics, Stanford University, Stanford, California
| | - Martin Seneviratne
- Department of Medicine, Stanford University School of Medicine, Stanford, California
- Department of Biomedical Data Science, Stanford University, Stanford, California
| | | | - Selen Bozkurt
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - James D. Brooks
- Department of Urology, Stanford University School of Medicine, Stanford, California
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, California
| | - Tina Hernandez-Boussard
- Department of Medicine, Stanford University School of Medicine, Stanford, California
- Department of Biomedical Data Science, Stanford University, Stanford, California
- Department of Surgery, Stanford University School of Medicine, Stanford, California
| |
Collapse
|
8
|
Magnani CJ, Bievre N, Baker LC, Brooks JD, Blayney DW, Hernandez-Boussard T. Real-world Evidence to Estimate Prostate Cancer Costs for First-line Treatment or Active Surveillance. EUR UROL SUPPL 2020; 23:20-29. [PMID: 33367287 PMCID: PMC7751921 DOI: 10.1016/j.euros.2020.11.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Background Prostate cancer is the most common cancer in men and second leading cause of cancer-related deaths. Changes in screening guidelines, adoption of active surveillance (AS), and implementation of high-cost technologies have changed treatment costs. Traditional cost-effectiveness studies rely on clinical trial protocols unlikely to capture actual practice behavior, and existing studies use data predating new technologies. Real-world evidence reflecting these changes is lacking. Objective To assess real-world costs of first-line prostate cancer management. Design setting and participants We used clinical electronic health records for 2008-2018 linked with the California Cancer Registry and the Medicare Fee Schedule to assess costs over 24 or 60 mo following diagnosis. We identified surgery or radiation treatments with structured methods, while we used both structured data and natural language processing to identify AS. Outcome measurements and statistical analysis Our results are risk-stratified calculated cost per day (CCPD) for first-line management, which are independent of treatment duration. We used the Kruskal-Wallis test to compare unadjusted CCPD while analysis of covariance log-linear models adjusted estimates for age and Charlson comorbidity. Results and limitations In 3433 patients, surgery (54.6%) was more common than radiation (22.3%) or AS (23.0%). Two years following diagnosis, AS ($2.97/d) was cheaper than surgery ($5.67/d) or radiation ($9.34/d) in favorable disease, while surgery ($7.17/d) was cheaper than radiation ($16.34/d) for unfavorable disease. At 5 yr, AS ($2.71/d) remained slightly cheaper than surgery ($2.87/d) and radiation ($4.36/d) in favorable disease, while for unfavorable disease surgery ($4.15/d) remained cheaper than radiation ($10.32/d). Study limitations include information derived from a single healthcare system and costs based on benchmark Medicare estimates rather than actual payment exchanges. Patient summary Active surveillance was cheaper than surgery (-47.6%) and radiation (-68.2%) at 2 yr for favorable-risk disease, which decreased by 5 yr (-5.6% and -37.8%, respectively). Surgery was less costly than radiation for unfavorable risk for both intervals (-56.1% and -59.8%, respectively).
Collapse
Affiliation(s)
| | - Nicolas Bievre
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Laurence C Baker
- Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA
| | - James D Brooks
- Department of Urology, Stanford University, Stanford, CA, USA
| | - Douglas W Blayney
- Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA.,Stanford Cancer Institute, School of Medicine, Stanford University, CA, USA.,Clinical Excellence Research Center, School of Medicine, Stanford University, CA, USA
| | | |
Collapse
|
9
|
Bozkurt S, Paul R, Coquet J, Sun R, Banerjee I, Brooks JD, Hernandez-Boussard T. Phenotyping severity of patient-centered outcomes using clinical notes: A prostate cancer use case. Learn Health Syst 2020; 4:e10237. [PMID: 33083539 PMCID: PMC7556418 DOI: 10.1002/lrh2.10237] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Revised: 06/15/2020] [Accepted: 06/23/2020] [Indexed: 01/12/2023] Open
Abstract
Introduction A learning health system (LHS) must improve care in ways that are meaningful to patients, integrating patient‐centered outcomes (PCOs) into core infrastructure. PCOs are common following cancer treatment, such as urinary incontinence (UI) following prostatectomy. However, PCOs are not systematically recorded because they can only be described by the patient, are subjective and captured as unstructured text in the electronic health record (EHR). Therefore, PCOs pose significant challenges for phenotyping patients. Here, we present a natural language processing (NLP) approach for phenotyping patients with UI to classify their disease into severity subtypes, which can increase opportunities to provide precision‐based therapy and promote a value‐based delivery system. Methods Patients undergoing prostate cancer treatment from 2008 to 2018 were identified at an academic medical center. Using a hybrid NLP pipeline that combines rule‐based and deep learning methodologies, we classified positive UI cases as mild, moderate, and severe by mining clinical notes. Results The rule‐based model accurately classified UI into disease severity categories (accuracy: 0.86), which outperformed the deep learning model (accuracy: 0.73). In the deep learning model, the recall rates for mild and moderate group were higher than the precision rate (0.78 and 0.79, respectively). A hybrid model that combined both methods did not improve the accuracy of the rule‐based model but did outperform the deep learning model (accuracy: 0.75). Conclusion Phenotyping patients based on indication and severity of PCOs is essential to advance a patient centered LHS. EHRs contain valuable information on PCOs and by using NLP methods, it is feasible to accurately and efficiently phenotype PCO severity. Phenotyping must extend beyond the identification of disease to provide classification of disease severity that can be used to guide treatment and inform shared decision‐making. Our methods demonstrate a path to a patient centered LHS that could advance precision medicine.
Collapse
Affiliation(s)
- Selen Bozkurt
- Department of Medicine, Biomedical Informatics Research Stanford University Stanford California USA
| | - Rohan Paul
- Department of Biomedical Data Sciences Stanford University Stanford California USA
| | - Jean Coquet
- Department of Medicine, Biomedical Informatics Research Stanford University Stanford California USA
| | - Ran Sun
- Department of Medicine, Biomedical Informatics Research Stanford University Stanford California USA
| | - Imon Banerjee
- Department of Biomedical Data Sciences Stanford University Stanford California USA.,Department of Radiology Stanford University Stanford California USA
| | - James D Brooks
- Department of Urology Stanford University Stanford California USA
| | - Tina Hernandez-Boussard
- Department of Medicine, Biomedical Informatics Research Stanford University Stanford California USA.,Department of Biomedical Data Sciences Stanford University Stanford California USA.,Department of Surgery Stanford University Stanford California USA
| |
Collapse
|
10
|
Cho S, Sin M, Tsapepas D, Dale LA, Husain SA, Mohan S, Natarajan K. Content Coverage Evaluation of the OMOP Vocabulary on the Transplant Domain Focusing on Concepts Relevant for Kidney Transplant Outcomes Analysis. Appl Clin Inform 2020; 11:650-658. [PMID: 33027834 DOI: 10.1055/s-0040-1716528] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Improving outcomes of transplant recipients within and across transplant centers is important with the increasing number of organ transplantations being performed. The current practice is to analyze the outcomes based on patient level data submitted to the United Network for Organ Sharing (UNOS). Augmenting the UNOS data with other sources such as the electronic health record will enrich the outcomes analysis, for which a common data model (CDM) can be a helpful tool for transforming heterogeneous source data into a uniform format. OBJECTIVES In this study, we evaluated the feasibility of representing concepts from the UNOS transplant registry forms with the Observational Medical Outcomes Partnership (OMOP) CDM vocabulary to understand the content coverage of OMOP vocabulary on transplant-specific concepts. METHODS Two annotators manually mapped a total of 3,571 unique concepts extracted from the UNOS registry forms to concepts in the OMOP vocabulary. Concept mappings were evaluated by (1) examining the agreement among the initial two annotators and (2) investigating the number of UNOS concepts not mapped to a concept in the OMOP vocabulary and then classifying them. A subset of mappings was validated by clinicians. RESULTS There was a substantial agreement between annotators with a kappa score of 0.71. We found that 55.5% of UNOS concepts could not be represented with OMOP standard concepts. The majority of unmapped UNOS concepts were categorized into transplant, measurement, condition, and procedure concepts. CONCLUSION We identified categories of unmapped concepts and found that some transplant-specific concepts do not exist in the OMOP vocabulary. We suggest that adding these missing concepts to OMOP would facilitate further research in the transplant domain.
Collapse
Affiliation(s)
- Sylvia Cho
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| | - Margaret Sin
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| | - Demetra Tsapepas
- Department of Surgery, Columbia University, New York, New York, United States.,Department of Transplantation, New York Presbyterian Hospital, New York, New York, United States
| | - Leigh-Anne Dale
- Department of Medicine, Columbia University Medical Center, New York, New York, United States
| | - Syed A Husain
- Division of Nephrology, Department of Medicine, Columbia University Medical Center, New York, New York, United States
| | - Sumit Mohan
- Division of Nephrology, Department of Medicine, Columbia University Medical Center, New York, New York, United States.,Department of Epidemiology, Columbia University Mailman School of Public Health, New York, New York, United States
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| |
Collapse
|
11
|
Wang SY, Azad AD, Lin SC, Hernandez-Boussard T, Pershing S. Intraocular Pressure Changes after Cataract Surgery in Patients with and without Glaucoma: An Informatics-Based Approach. Ophthalmol Glaucoma 2020; 3:343-349. [PMID: 32703703 DOI: 10.1016/j.ogla.2020.06.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/28/2020] [Accepted: 06/03/2020] [Indexed: 10/24/2022]
Abstract
PURPOSE To evaluate changes in intraocular pressure (IOP) after cataract surgery among patients with or without glaucoma using automated extraction of data from electronic health records (EHRs). DESIGN Retrospective cohort study. PARTICIPANTS Adults who underwent standalone cataract surgery at a single academic center from 2009-2018. METHODS Patient information was identified from procedure and billing codes, demographic tables, medication orders, clinical notes, and eye examination fields in the EHR. A previously validated natural language processing pipeline was used to identify laterality of cataract surgery from operative notes and laterality of eye medications from medication orders. Cox proportional hazards modeling evaluated factors associated with the main outcome of sustained postoperative IOP reduction. MAIN OUTCOME MEASURES Sustained post-cataract surgery IOP reduction, measured at 14 months or the last follow-up while using equal or fewer glaucoma medications compared with baseline and without additional glaucoma laser or surgery on the operative eye. RESULTS The median follow-up for 7574 eyes of 4883 patients who underwent cataract surgery was 244 days. The mean preoperative IOP for all patients was 15.2 mmHg (standard deviation [SD], 3.4 mmHg), which decreased to 14.2 mmHg (SD, 3.0 mmHg) at 12 months after surgery. Patients with IOP of 21.0 mmHg or more showed mean postoperative IOP reduction ranging from -6.2 to -6.9 mmHg. Cataract surgery was more likely to yield sustained IOP reduction for patients with primary open-angle glaucoma (hazard ratio [HR], 1.19; 95% confidence interval, 1.05-1.36) or narrow angles or angle closure (HR, 1.21; 95% confidence interval, 1.08-1.34) compared with patients without glaucoma. Those with a higher baseline IOP were more likely to achieve postoperative IOP reduction (HR, 1.06 per 1-mmHg increase in baseline IOP; 95% confidence interval, 1.05-1.07). CONCLUSIONS Our results suggest that patients with primary open-angle glaucoma or with narrow angles or chronic angle closure were more likely to achieve sustained IOP reduction after cataract surgery. Patients with higher baseline IOP had increasingly higher odds of achieving reduction in IOP. This evidence demonstrates the potential usefulness of a pipeline for automated extraction of ophthalmic surgical outcomes from EHR to answer key clinical questions on a large scale.
Collapse
Affiliation(s)
- Sophia Y Wang
- Byers Eye Institute, Stanford University, Palo Alto, California.
| | - Amee D Azad
- Stanford University School of Medicine, Stanford, California
| | - Shan C Lin
- Glaucoma Center of San Francisco, San Francisco, California
| | | | - Suzann Pershing
- Byers Eye Institute, Stanford University, Palo Alto, California; Veterans Affairs Palo Alto Health Care System, Palo Alto, California
| |
Collapse
|
12
|
Meregaglia M, Ciani O, Banks H, Salcher-Konrad M, Carney C, Jayawardana S, Williamson P, Fattore G. A scoping review of core outcome sets and their 'mapping' onto real-world data using prostate cancer as a case study. BMC Med Res Methodol 2020; 20:41. [PMID: 32103725 PMCID: PMC7045588 DOI: 10.1186/s12874-020-00928-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Accepted: 02/17/2020] [Indexed: 12/14/2022] Open
Abstract
Background A Core Outcomes Set (COS) is an agreed minimum set of outcomes that should be reported in all clinical studies related to a specific condition. Using prostate cancer as a case study, we identified, summarized, and critically appraised published COS development studies and assessed the degree of overlap between them and selected real-world data (RWD) sources. Methods We conducted a scoping review of the Core Outcome Measures in Effectiveness Trials (COMET) Initiative database to identify all COS studies developed for prostate cancer. Several characteristics (i.e., study type, methods for consensus, type of participants, outcomes included in COS and corresponding measurement instruments, timing, and sources) were extracted from the studies; outcomes were classified according to a predefined 38-item taxonomy. The study methodology was assessed based on the recent COS-STAndards for Development (COS-STAD) recommendations. A ‘mapping’ exercise was conducted between the COS identified and RWD routinely collected in selected European countries. Results Eleven COS development studies published between 1995 and 2017 were retrieved, of which 8 were classified as ‘COS for clinical trials and clinical research’, 2 as ‘COS for practice’ and 1 as ‘COS patient reported outcomes’. Recommended outcomes were mainly categorized into ‘mortality and survival’ (17%), ‘outcomes related to neoplasm’ (18%), and ‘renal and urinary outcomes’ (13%) with no relevant differences among COS study types. The studies generally fulfilled the criteria for the COS-STAD ‘scope specification’ domain but not the ‘stakeholders involved’ and ‘consensus process’ domains. About 72% overlap existed between COS and linked administrative data sources, with important gaps. Linking with patient registries improved coverage (85%), but was sometimes limited to smaller follow-up patient groups. Conclusions This scoping review identified few COS development studies in prostate cancer, some quite dated and with a growing level of methodological quality over time. This study revealed promising overlap between COS and RWD sources, though with important limitations; linking established, national patient registries to administrative data provide the best means to additionally capture patient-reported and some clinical outcomes over time. Thus, increasing the combination of different data sources and the interoperability of systems to follow larger patient groups in RWD is required.
Collapse
Affiliation(s)
| | - Oriana Ciani
- CERGAS, SDA Bocconi, Milan, Italy.,Institute of Health Research, University of Exeter Medical School, Exeter, UK
| | | | | | | | | | - Paula Williamson
- MRC North West Hub for Trials Methodology Research, Department of Biostatistics, University of Liverpool, Liverpool, UK
| | - Giovanni Fattore
- CERGAS, SDA Bocconi, Milan, Italy.,Department of Social and Political Sciences, Bocconi University, Milan, Italy
| |
Collapse
|
13
|
Hernandez-Boussard T, Blayney DW, Brooks JD. Leveraging Digital Data to Inform and Improve Quality Cancer Care. Cancer Epidemiol Biomarkers Prev 2020; 29:816-822. [PMID: 32066619 DOI: 10.1158/1055-9965.epi-19-0873] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 10/03/2019] [Accepted: 02/12/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Efficient capture of routine clinical care and patient outcomes is needed at a population-level, as is evidence on important treatment-related side effects and their effect on well-being and clinical outcomes. The increasing availability of electronic health records (EHR) offers new opportunities to generate population-level patient-centered evidence on oncologic care that can better guide treatment decisions and patient-valued care. METHODS This study includes patients seeking care at an academic medical center, 2008 to 2018. Digital data sources are combined to address missingness, inaccuracy, and noise common to EHR data. Clinical concepts were identified and extracted from EHR unstructured data using natural language processing (NLP) and machine/deep learning techniques. All models are trained, tested, and validated on independent data samples using standard metrics. RESULTS We provide use cases for using EHR data to assess guideline adherence and quality measurements among patients with cancer. Pretreatment assessment was evaluated by guideline adherence and quality metrics for cancer staging metrics. Our studies in perioperative quality focused on medications administered and guideline adherence. Patient outcomes included treatment-related side effects and patient-reported outcomes. CONCLUSIONS Advanced technologies applied to EHRs present opportunities to advance population-level quality assessment, to learn from routinely collected clinical data for personalized treatment guidelines, and to augment epidemiologic and population health studies. The effective use of digital data can inform patient-valued care, quality initiatives, and policy guidelines. IMPACT A comprehensive set of health data analyzed with advanced technologies results in a unique resource that facilitates wide-ranging, innovative, and impactful research on prostate cancer. This work demonstrates new ways to use the EHRs and technology to advance epidemiologic studies and benefit oncologic care.See all articles in this CEBP Focus section, "Modernizing Population Science."
Collapse
Affiliation(s)
- Tina Hernandez-Boussard
- Department of Medicine, Stanford University, Stanford, California. .,Department of Biomedical Data Science, Stanford University, Stanford, California.,Department of Surgery, Stanford University School of Medicine, Stanford, California
| | - Douglas W Blayney
- Department of Medicine, Stanford University, Stanford, California.,Stanford Cancer Institute, Stanford University School of Medicine, Stanford, California
| | - James D Brooks
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, California.,Department of Urology, Stanford University School of Medicine, Stanford, California
| |
Collapse
|
14
|
Li K, Banerjee I, Magnani CJ, Blayney DW, Brooks JD, Hernandez-Boussard T. Clinical Documentation to Predict Factors Associated with Urinary Incontinence Following Prostatectomy for Prostate Cancer. Res Rep Urol 2020; 12:7-14. [PMID: 32158720 PMCID: PMC6986242 DOI: 10.2147/rru.s234178] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 12/11/2019] [Indexed: 02/01/2023] Open
Abstract
Background Advances in data collection provide opportunities to use population samples in identifying risk factors for urinary incontinence (UI), which occurs in up to 71% of men with prostate cancer following prostatectomy. Most studies on patient-centered outcomes use surveys or manual chart abstraction for data collection, which can be costly and difficult to scale. We sought to evaluate rates of and risk factors for UI following prostatectomy using natural language processing on electronic health record (EHR) data. Methods We conducted a retrospective analysis of patients undergoing prostatectomy for prostate cancer between January 2008 and August 2018 using EHR data from an academic medical center. UI incidence for each patient in the cohort was assessed using natural language processing from clinical notes generated pre- and postoperatively. Multivariable logistic regression was used to evaluate potential risk factors for postoperative UI at various time points within 2 years following surgery. Results We identified 3792 patients who underwent prostatectomy for prostate cancer. We found a significant association between preoperative UI and UI in the first (odds ratio [OR], 2.30; 95% confidence interval [CI], 1.24–4.28) and second (OR 2.24, 95% CI 1.04–4.83) years following surgery. Preoperative body mass index was also associated with UI in the second postoperative year (OR 1.11, 95% CI 1.02–1.21). Conclusion We show that a natural language processing approach using clinical narratives can be used to assess risk for UI in prostate cancer patients. Unstructured clinical narrative text can help advance future population-level research in patient-centered outcomes and quality of care.
Collapse
Affiliation(s)
- Kevin Li
- Stanford University School of Medicine, Stanford, CA, USA
| | - Imon Banerjee
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
| | | | - Douglas W Blayney
- Department of Medicine (Oncology), Stanford University School of Medicine, Stanford, CA, USA
| | - James D Brooks
- Department of Urology (Urologic Oncology), Stanford University School of Medicine, Stanford, CA, USA
| | - Tina Hernandez-Boussard
- Department of Medicine (Biomedical Informatics), Biomedical Data Sciences, and Surgery, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
15
|
Lenain R, Seneviratne MG, Bozkurt S, Blayney DW, Brooks JD, Hernandez-Boussard T. Machine Learning Approaches for Extracting Stage from Pathology Reports in Prostate Cancer. Stud Health Technol Inform 2019; 264:1522-1523. [PMID: 31438212 DOI: 10.3233/shti190515] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Clinical and pathological stage are defining parameters in oncology, which direct a patient's treatment options and prognosis. Pathology reports contain a wealth of staging information that is not stored in structured form in most electronic health records (EHRs). Therefore, we evaluated three supervised machine learning methods (Support Vector Machine, Decision Trees, Gradient Boosting) to classify free-text pathology reports for prostate cancer into T, N and M stage groups.
Collapse
Affiliation(s)
- Raphael Lenain
- Department of Medicine, Biomedical Informatics, Stanford University, Stanford, CA, USA
| | - Martin G Seneviratne
- Department of Medicine, Biomedical Informatics, Stanford University, Stanford, CA, USA
| | - Selen Bozkurt
- Department of Medicine, Biomedical Informatics, Stanford University, Stanford, CA, USA.,Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Douglas W Blayney
- Department of Medicine, Division of Medical Oncology, Stanford University, Stanford, CA, USA
| | - James D Brooks
- Department of Urology, Stanford University, Stanford, CA, USA
| | - Tina Hernandez-Boussard
- Department of Medicine, Biomedical Informatics, Stanford University, Stanford, CA, USA.,Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| |
Collapse
|
16
|
Extracting Patient-Centered Outcomes from Clinical Notes in Electronic Health Records: Assessment of Urinary Incontinence After Radical Prostatectomy. EGEMS 2019; 7:43. [PMID: 31497615 PMCID: PMC6706996 DOI: 10.5334/egems.297] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Objective: To assess documentation of urinary incontinence (UI) in prostatectomy patients using unstructured clinical notes from Electronic Health Records (EHRs). Methods: We developed a weakly-supervised natural language processing tool to extract assessments, as recorded in unstructured text notes, of UI before and after radical prostatectomy in a single academic practice across multiple clinicians. Validation was carried out using a subset of patients who completed EPIC-26 surveys before and after surgery. The prevalence of UI as assessed by EHR and EPIC-26 was compared using repeated-measures ANOVA. The agreement of reported UI between EHR and EPIC-26 was evaluated using Cohen’s Kappa coefficient. Results: A total of 4870 patients and 716 surveys were included. Preoperative prevalence of UI was 12.7 percent. Postoperative prevalence was 71.8 percent at 3 months, 50.2 percent at 6 months and 34.4 and 41.8 at 12 and 24 months, respectively. Similar rates were recorded by physicians in the EHR, particularly for early follow-up. For all time points, the agreement between EPIC-26 and the EHR was moderate (all p < 0.001) and ranged from 86.7 percent agreement at baseline (Kappa = 0.48) to 76.4 percent agreement at 24 months postoperative (Kappa = 0.047). Conclusions: We have developed a tool to assess documentation of UI after prostatectomy using EHR clinical notes. Our results suggest such a tool can facilitate unbiased measurement of important PCOs using real-word data, which are routinely recorded in EHR unstructured clinician notes. Integrating PCO information into clinical decision support can help guide shared treatment decisions and promote patient-valued care.
Collapse
|
17
|
Bozkurt S, Kan KM, Ferrari MK, Rubin DL, Blayney DW, Hernandez-Boussard T, Brooks JD. Is it possible to automatically assess pretreatment digital rectal examination documentation using natural language processing? A single-centre retrospective study. BMJ Open 2019; 9:e027182. [PMID: 31324681 PMCID: PMC6661600 DOI: 10.1136/bmjopen-2018-027182] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
OBJECTIVES To develop and test a method for automatic assessment of a quality metric, provider-documented pretreatment digital rectal examination (DRE), using the outputs of a natural language processing (NLP) framework. SETTING An electronic health records (EHR)-based prostate cancer data warehouse was used to identify patients and associated clinical notes from 1 January 2005 to 31 December 2017. Using a previously developed natural language processing pipeline, we classified DRE assessment as documented (currently or historically performed), deferred (or suggested as a future examination) and refused. PRIMARY AND SECONDARY OUTCOME MEASURES We investigated the quality metric performance, documentation 6 months before treatment and identified patient and clinical factors associated with metric performance. RESULTS The cohort included 7215 patients with prostate cancer and 426 227 unique clinical notes associated with pretreatment encounters. DREs of 5958 (82.6%) patients were documented and 1257 (17.4%) of patients did not have a DRE documented in the EHR. A total of 3742 (51.9%) patient DREs were documented within 6 months prior to treatment, meeting the quality metric. Patients with private insurance had a higher rate of DRE 6 months prior to starting treatment as compared with Medicaid-based or Medicare-based payors (77.3%vs69.5%, p=0.001). Patients undergoing chemotherapy, radiation therapy or surgery as the first line of treatment were more likely to have a documented DRE 6 months prior to treatment. CONCLUSION EHRs contain valuable unstructured information and with NLP, it is feasible to accurately and efficiently identify quality metrics with current documentation clinician workflow.
Collapse
Affiliation(s)
- Selen Bozkurt
- Biomedical Data Science, Stanford University, Stanford, CA, USA
- Medicine (Biomedical Informatics), Stanford University, Stanford, CA, USA
| | - Kathleen M Kan
- Urology, Stanford Lucile Salter Packard Children's Hospital, Stanford, CA, USA
| | | | - Daniel L Rubin
- Biomedical Data Science, Stanford University, Stanford, CA, USA
- Radiology, Stanford University, Stanford, CA, USA
| | | | - Tina Hernandez-Boussard
- Biomedical Data Science, Stanford University, Stanford, CA, USA
- Medicine (Biomedical Informatics), Stanford University, Stanford, CA, USA
- Surgery, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
18
|
Magnani CJ, Li K, Seto T, McDonald KM, Blayney DW, Brooks JD, Hernandez-Boussard T. PSA Testing Use and Prostate Cancer Diagnostic Stage After the 2012 U.S. Preventive Services Task Force Guideline Changes. J Natl Compr Canc Netw 2019; 17:795-803. [PMID: 31319390 PMCID: PMC7195904 DOI: 10.6004/jnccn.2018.7274] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 01/15/2019] [Indexed: 12/28/2022]
Abstract
BACKGROUND Most patients with prostate cancer are diagnosed with low-grade, localized disease and may not require definitive treatment. In 2012, the U.S. Preventive Services Task Force (USPSTF) recommended against prostate cancer screening to address overdetection and overtreatment. This study sought to determine the effect of guideline changes on prostate-specific antigen (PSA) screening and initial diagnostic stage for prostate cancer. PATIENTS AND METHODS A difference-in-differences analysis was conducted to compare changes in PSA screening (exposure) relative to cholesterol testing (control) after the 2012 USPSTF guideline changes, and chi-square test was used to determine whether there was a subsequent decrease in early-stage, low-risk prostate cancer diagnoses. Data were derived from a tertiary academic medical center's electronic health records, a national commercial insurance database (OptumLabs), and the SEER database for men aged ≥35 years before (2008-2011) and after (2013-2016) the guideline changes. RESULTS In both the academic center and insurance databases, PSA testing significantly decreased for all men compared with the control. The greatest decrease was among men aged 55 to 74 years at the academic center and among those aged ≥75 years in the commercial database. The proportion of early-stage prostate cancer diagnoses ( CONCLUSIONS In primary care, PSA testing decreased significantly and fewer prostate cancers were diagnosed at an early stage, suggesting provider adherence to the 2012 USPSTF guideline changes. Long-term follow-up is needed to understand the effect of decreased screening on prostate cancer survival.
Collapse
Affiliation(s)
| | - Kevin Li
- School of Medicine, Stanford University
| | - Tina Seto
- Stanford University School of Medicine IRT Research Technology
| | | | - Douglas W. Blayney
- Department of Medicine, Stanford University
- Stanford Cancer Institute, Stanford University
| | | | - Tina Hernandez-Boussard
- Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| |
Collapse
|
19
|
Coquet J, Bozkurt S, Kan KM, Ferrari MK, Blayney DW, Brooks JD, Hernandez-Boussard T. Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients. J Biomed Inform 2019; 94:103184. [PMID: 31014980 PMCID: PMC6584041 DOI: 10.1016/j.jbi.2019.103184] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 04/15/2019] [Accepted: 04/19/2019] [Indexed: 01/31/2023]
Abstract
OBJECTIVE Clinical care guidelines recommend that newly diagnosed prostate cancer patients at high risk for metastatic spread receive a bone scan prior to treatment and that low risk patients not receive it. The objective was to develop an automated pipeline to interrogate heterogeneous data to evaluate the use of bone scans using a two different Natural Language Processing (NLP) approaches. MATERIALS AND METHODS Our cohort was divided into risk groups based on Electronic Health Records (EHR). Information on bone scan utilization was identified in both structured data and free text from clinical notes. Our pipeline annotated sentences with a combination of a rule-based method using the ConText algorithm (a generalization of NegEx) and a Convolutional Neural Network (CNN) method using word2vec to produce word embeddings. RESULTS A total of 5500 patients and 369,764 notes were included in the study. A total of 39% of patients were high-risk and 73% of these received a bone scan; of the 18% low risk patients, 10% received one. The accuracy of CNN model outperformed the rule-based model one (F-measure = 0.918 and 0.897 respectively). We demonstrate a combination of both models could maximize precision or recall, based on the study question. CONCLUSION Using structured data, we accurately classified patients' cancer risk group, identified bone scan documentation with two NLP methods, and evaluated guideline adherence. Our pipeline can be used to provide concrete feedback to clinicians and guide treatment decisions.
Collapse
Affiliation(s)
- Jean Coquet
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Selen Bozkurt
- Department of Medicine, Stanford University, Stanford, CA, USA; Department of Biomedical Data Science, Stanford University, Stanford, USA
| | - Kathleen M Kan
- Department of Urology, Stanford University School of Medicine, Stanford, USA
| | - Michelle K Ferrari
- Department of Urology, Stanford University School of Medicine, Stanford, USA
| | - Douglas W Blayney
- Department of Medicine, Stanford University, Stanford, CA, USA; Stanford Cancer Institute, Stanford University School of Medicine, Stanford, USA
| | - James D Brooks
- Department of Urology, Stanford University School of Medicine, Stanford, USA; Stanford Cancer Institute, Stanford University School of Medicine, Stanford, USA
| | - Tina Hernandez-Boussard
- Department of Medicine, Stanford University, Stanford, CA, USA; Department of Biomedical Data Science, Stanford University, Stanford, USA; Department of Surgery, Stanford University School of Medicine, Stanford, USA.
| |
Collapse
|
20
|
Banerjee I, Li K, Seneviratne M, Ferrari M, Seto T, Brooks JD, Rubin DL, Hernandez-Boussard T. Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment. JAMIA Open 2019; 2:150-159. [PMID: 31032481 PMCID: PMC6482003 DOI: 10.1093/jamiaopen/ooy057] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Revised: 11/14/2018] [Accepted: 11/28/2018] [Indexed: 11/13/2022] Open
Abstract
Background The population-based assessment of patient-centered outcomes (PCOs) has been limited by the efficient and accurate collection of these data. Natural language processing (NLP) pipelines can determine whether a clinical note within an electronic medical record contains evidence on these data. We present and demonstrate the accuracy of an NLP pipeline that targets to assess the presence, absence, or risk discussion of two important PCOs following prostate cancer treatment: urinary incontinence (UI) and bowel dysfunction (BD). Methods We propose a weakly supervised NLP approach which annotates electronic medical record clinical notes without requiring manual chart review. A weighted function of neural word embedding was used to create a sentence-level vector representation of relevant expressions extracted from the clinical notes. Sentence vectors were used as input for a multinomial logistic model, with output being either presence, absence or risk discussion of UI/BD. The classifier was trained based on automated sentence annotation depending only on domain-specific dictionaries (weak supervision). Results The model achieved an average F1 score of 0.86 for the sentence-level, three-tier classification task (presence/absence/risk) in both UI and BD. The model also outperformed a pre-existing rule-based model for note-level annotation of UI with significant margin. Conclusions We demonstrate a machine learning method to categorize clinical notes based on important PCOs that trains a classifier on sentence vector representations labeled with a domain-specific dictionary, which eliminates the need for manual engineering of linguistic rules or manual chart review for extracting the PCOs. The weakly supervised NLP pipeline showed promising sensitivity and specificity for identifying important PCOs in unstructured clinical text notes compared to rule-based algorithms.
Collapse
Affiliation(s)
- Imon Banerjee
- Department of Biomedical Data Science, Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
| | - Kevin Li
- Stanford University School of Medicine, 291 Campus Drive, Stanford, California 94305-5479, USA
| | - Martin Seneviratne
- Department of Biomedical Data Science, Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
- Department of Biomedical Informatics, Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
| | - Michelle Ferrari
- Department of Urology - Divisions, Stanford University School of Medicine, 875 Blake Wilbur, Stanford, California 94305-5479, USA
| | - Tina Seto
- IRT Research Technology, Stanford University School of Medicine, Stanford, California 94305-5479, USA
| | - James D Brooks
- Department of Urology - Divisions, Stanford University School of Medicine, 875 Blake Wilbur, Stanford, California 94305-5479, USA
| | - Daniel L Rubin
- Department of Biomedical Data Science, Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
- Department of Radiology, Stanford University School of Medicine, Stanford, California 94305-5479, USA
- Department of Medicine (Biomedical Informatics), Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
| | - Tina Hernandez-Boussard
- Department of Biomedical Data Science, Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
- Department of Medicine (Biomedical Informatics), Stanford University School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, California 94305-5479, USA
- Department of Surgery, Stanford University School of Medicine, 300 Pasteur Drive Stanford, California 94305-2200, USA
| |
Collapse
|
21
|
Seneviratne MG, Bozkurt S, Patel MI, Seto T, Brooks JD, Blayney DW, Kurian AW, Hernandez-Boussard T. Distribution of global health measures from routinely collected PROMIS surveys in patients with breast cancer or prostate cancer. Cancer 2019; 125:943-951. [PMID: 30512191 PMCID: PMC6403006 DOI: 10.1002/cncr.31895] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 10/17/2018] [Accepted: 10/31/2018] [Indexed: 01/07/2023]
Abstract
BACKGROUND The collection of patient-reported outcomes (PROs) is an emerging priority internationally, guiding clinical care, quality improvement projects and research studies. After the deployment of Patient-Reported Outcomes Measurement Information System (PROMIS) surveys in routine outpatient workflows at an academic cancer center, electronic health record data were used to evaluate survey completion rates and self-reported global health measures across 2 tumor types: breast and prostate cancer. METHODS This study retrospectively analyzed 11,657 PROMIS surveys from patients with breast cancer and 4411 surveys from patients with prostate cancer, and it calculated survey completion rates and global physical health (GPH) and global mental health (GMH) scores between 2013 and 2018. RESULTS A total of 36.6% of eligible patients with breast cancer and 23.7% of patients with prostate cancer completed at least 1 survey, with completion rates lower among black patients for both tumor types (P < .05). The mean T scores (calibrated to a general population mean of 50) for GPH were 48.4 ± 9 for breast cancer and 50.6 ± 9 for prostate cancer, and the GMH scores were 52.7 ± 8 and 52.1 ± 9, respectively. GPH and GMH were frequently lower among ethnic minorities, patients without private health insurance, and those with advanced disease. CONCLUSIONS This analysis provides important baseline data on patient-reported global health in breast and prostate cancer. Demonstrating that PROs can be integrated into clinical workflows, this study shows that supportive efforts may be needed to improve PRO collection and global health endpoints in vulnerable populations.
Collapse
Affiliation(s)
| | - Selen Bozkurt
- Department of Biomedical Informatics, Stanford University, CA
| | | | - Tina Seto
- Department of Biomedical Informatics, Stanford University, CA
| | | | | | - Allison W. Kurian
- Department of Medicine (Oncology), Stanford University, CA
- Department of Health Research and Policy, Stanford University, CA
| | | |
Collapse
|
22
|
Seneviratne MG, Kahn MG, Hernandez-Boussard T. Merging heterogeneous clinical data to enable knowledge discovery. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:439-443. [PMID: 30864344 PMCID: PMC6447393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The vision of precision medicine relies on the integration of large-scale clinical, molecular and environmental datasets. Data integration may be thought of along two axes: data fusion across institutions, and data fusion across modalities. Cross-institutional data sharing that maintains semantic integrity hinges on the adoption of data standards and a push toward ontology-driven integration. The goal should be the creation of query-able data repositories spanning primary and tertiary care providers, disease registries, research organizations etc. to produce rich longitudinal datasets. Cross-modality sharing involves the integration of multiple data streams, from structured EHR data (diagnosis codes, laboratory tests) to genomics, imaging, monitors and patient-generated data including wearable devices. This integration presents unique technical, semantic, and ethical challenges; however recent work suggests that multi-modal clinical data can significantly improve the performance of phenotyping and prediction algorithms, powering knowledge discovery at the patient- and population-level.
Collapse
Affiliation(s)
- Martin G. Seneviratne
- Department of Biomedical Data Science, Stanford University, 1265 Welch Rd, Stanford, CA 94305, United States,
| | - Michael G. Kahn
- Colorado Clinical and Translational Sciences Institute, Denver, CO 80045, United States,
| | - Tina Hernandez-Boussard
- Department of Medicine, Biomedical Informatics, Stanford University, 1265 Welch Rd, Stanford, CA 94305, United States,
| |
Collapse
|
23
|
Seneviratne MG, Banda JM, Brooks JD, Shah NH, Hernandez-Boussard TM. Identifying Cases of Metastatic Prostate Cancer Using Machine Learning on Electronic Health Records. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1498-1504. [PMID: 30815195 PMCID: PMC6371284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Cancer stage is rarely captured in structured form in the electronic health record (EHR). We evaluate the performance of a classifier, trained on structured EHR data, in identifying prostate cancer patients with metastatic disease. Using EHR data for a cohort of 5,861 prostate cancer patients mapped to the Observational Health Data Sciences and Informatics (OHDSI) data model, we constructed feature vectors containing frequency counts of conditions, procedures, medications, observations and laboratory values. Staging information from the California Cancer Registry was used as the ground-truth. For identifying patients with metastatic disease, a random forest model achieved precision and recall of 0.90, 0.40 using data within 12 months of diagnosis. This compared to precision 0.33, recall 0.54 for an ICD code-based query. High-precision classifiers using hundreds of structured data elements significantly outperform ICD queries, and may assist in identifying cohorts for observational research or clinical trial matching.
Collapse
Affiliation(s)
| | - Juan M Banda
- Department of Biomedical Informatics, Stanford School of Medicine, CA
| | | | - Nigam H Shah
- Department of Biomedical Informatics, Stanford School of Medicine, CA
| | | |
Collapse
|
24
|
Bozkurt S, Park JI, Kan KM, Ferrari M, Rubin DL, Brooks JD, Hernandez-Boussard T. An Automated Feature Engineering for Digital Rectal Examination Documentation using Natural Language Processing. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:288-294. [PMID: 30815067 PMCID: PMC6371344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Digital rectal examination (DRE) is considered a quality metric for prostate cancer care. However, much of the DRE related rich information is documented as free-text in clinical narratives. Therefore, we aimed to develop a natural language processing (NLP) pipeline for automatic documentation of DRE in clinical notes using a domain-specific dictionary created by clinical experts and an extended version of the same dictionary learned by clinical notes using distributional semantics algorithms. The proposed pipeline was compared to a baseline NLP algorithm and the results of the proposed pipeline were found superior in terms of precision (0.95) and recall (0.90) for documentation of DRE. We believe the rule-based NLP pipeline enriched with terms learned from the whole corpus can provide accurate and efficient identification of this quality metric.
Collapse
Affiliation(s)
- Selen Bozkurt
- Department of Medicine, Center for Biomedical Informatics Research, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| | - Jung In Park
- Department of Medicine, Center for Biomedical Informatics Research, Stanford University, Stanford, CA
| | - Kathleen Mary Kan
- Department of Urology, Stanford University School of Medicine, Stanford, CA
| | - Michelle Ferrari
- Department of Urology, Stanford University School of Medicine, Stanford, CA
| | - Daniel L Rubin
- Department of Medicine, Center for Biomedical Informatics Research, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford University, Stanford, CA
- Department of Radiology, Stanford University School of Medicine, Stanford, CA
| | - James D Brooks
- Department of Urology, Stanford University School of Medicine, Stanford, CA
| | - Tina Hernandez-Boussard
- Department of Medicine, Center for Biomedical Informatics Research, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| |
Collapse
|