226
|
Chen Z, Zhang H, George T, Prosperi M, Guo Y, Braithwaite D, Shenkman E, Licht J, Bian J. Abstract PO-071: Simulation of colorectal cancer clinical trials using real-world data and machine learning. Clin Cancer Res 2021. [DOI: 10.1158/1557-3265.adi21-po-071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Objectives To explore the feasibility of using real-world data (RWD) with machine learning methods to simulate colorectal cancer (CRC) trials (i.e., 6 Phase III randomized clinical trials comparing other treatment regimens with FOLFIRI—an FDA-approved standard of care first line chemotherapy treatment in patients with metastatic CRC). Methods We used RWD from the OneFlorida Clinical Research Consortium, a clinical research network contributing to the national PCORnet with longitudinal linked electronic health records of ~15 million Floridians. We used the study protocols in the original trials, including the eligibility criteria, to define the various study populations. We focused on patients’ safety outcomes in terms of the occurrence of severe adverse events (SAEs) after the treatments; calculated SAE prevalence, mean SAEs per patient, and SAE event rates for each category defined in the CTCAE v5.0. We considered two scenarios: (1) only simulating the control arm (CA) (i.e., the FOLFIRI arm), and (2) simulating both the CA and experimental arm (EA) (e.g., Panitumumab + FOLFIRI) and calculating the relative risk of SAE between the 2 arms. Two sampling strategies were used to simulate study population: random sampling and proportional sampling with gender and race. Among the 6 trials, only 2 had sufficient patients in OneFlorida for the two-arm simulations. We used propensity score matching (PSM) on baseline characteristics such as age, gender, race, and comorbidities to simulate the randomization process. In addition to the traditional logistic regression (LR) model, we considered machine learning (ML) models for PSM (such as neural networks) as LR-based PSM assumes linearity of the underlying variables. Each trial was simulated 1,000 times. Results Consistent with the existing literature, the mean SAE and SAE event rates were higher in all CAs simulated through RWD from OneFlorida. The proportional sampling strategy provided estimates of SAE prevalence more comparable to rates reported by the original trials. In the two-arm simulations, no significant differences were observed in the matched case-control samples using LR or ML methods. As expected with patients treated in real-world settings, larger mean SAEs and SAE event rates (but similar SAE prevalence ) were observed in the simulations compared with the original trials. The risk ratios of having SAE obtained from simulations comparing CA vs. EA were very close to the ratios calculated from the original trials. Conclusion Our study showed feasibility of simulating cancer trials using RWD and obtained comparable estimates to the original trial in terms of patient safety outcomes. Despite more SAEs in RWD, ratios between CAs and EAs were similar to the previously published rigorously conducted trials. Future in-depth investigations are warranted and shall consider state-of-the-art AI methods such as deep learning and causal AI methods to help tackle issues with using RWD for cancer trial simulation (e.g., data bias, high-dimensionality).
Citation Format: Zhaoyi Chen, Hansi Zhang, Thomas George, Mattia Prosperi, Yi Guo, Dejana Braithwaite, Elizabeth Shenkman, Jonathan Licht, Jiang Bian. Simulation of colorectal cancer clinical trials using real-world data and machine learning [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PO-071.
Collapse
|
227
|
Ghosh S, Bian J, Guo Y, Prosperi M. Deep propensity network using a sparse autoencoder for estimation of treatment effects. J Am Med Inform Assoc 2021; 28:1197-1206. [PMID: 33594415 DOI: 10.1093/jamia/ocaa346] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 11/22/2020] [Accepted: 12/28/2020] [Indexed: 12/11/2022] Open
Abstract
OBJECTIVE Drawing causal estimates from observational data is problematic, because datasets often contain underlying bias (eg, discrimination in treatment assignment). To examine causal effects, it is important to evaluate what-if scenarios-the so-called "counterfactuals." We propose a novel deep learning architecture for propensity score matching and counterfactual prediction-the deep propensity network using a sparse autoencoder (DPN-SA)-to tackle the problems of high dimensionality, nonlinear/nonparallel treatment assignment, and residual confounding when estimating treatment effects. MATERIALS AND METHODS We used 2 randomized prospective datasets, a semisynthetic one with nonlinear/nonparallel treatment selection bias and simulated counterfactual outcomes from the Infant Health and Development Program and a real-world dataset from the LaLonde's employment training program. We compared different configurations of the DPN-SA against logistic regression and LASSO as well as deep counterfactual networks with propensity dropout (DCN-PD). Models' performances were assessed in terms of average treatment effects, mean squared error in precision on effect's heterogeneity, and average treatment effect on the treated, over multiple training/test runs. RESULTS The DPN-SA outperformed logistic regression and LASSO by 36%-63%, and DCN-PD by 6%-10% across all datasets. All deep learning architectures yielded average treatment effects close to the true ones with low variance. Results were also robust to noise-injection and addition of correlated variables. Code is publicly available at https://github.com/Shantanu48114860/DPN-SAz. DISCUSSION AND CONCLUSION Deep sparse autoencoders are particularly suited for treatment effect estimation studies using electronic health records because they can handle high-dimensional covariate sets, large sample sizes, and complex heterogeneity in treatment assignments.
Collapse
|
228
|
Prosperi M, Guo Y, Bian J. Bagged random causal networks for interventional queries on observational biomedical datasets. J Biomed Inform 2021; 115:103689. [PMID: 33548542 DOI: 10.1016/j.jbi.2021.103689] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Revised: 12/30/2020] [Accepted: 01/23/2021] [Indexed: 11/30/2022]
Abstract
Learning causal effects from observational data, e.g. estimating the effect of a treatment on survival by data-mining electronic health records (EHRs), can be biased due to unmeasured confounders, mediators, and colliders. When the causal dependencies among features/covariates are expressed in the form of a directed acyclic graph, using do-calculus it is possible to identify one or more adjustment sets for eliminating the bias on a given causal query under certain assumptions. However, prior knowledge of the causal structure might be only partial; algorithms for causal structure discovery often provide ambiguous solutions, and their computational complexity becomes practically intractable when the feature sets grow large. We hypothesize that the estimation of the true causal effect of a causal query on to an outcome can be approximated as an ensemble of lower complexity estimators, namely bagged random causal networks. A bagged random causal network is an ensemble of subnetworks constructed by sampling the feature subspaces (with the query, the outcome, and a random number of other features), drawing conditional dependencies among the features, and inferring the corresponding adjustment sets. The causal effect can be then estimated by any regression function of the outcome by the query paired with the adjustment sets. Through simulations and a real-world clinical dataset (class III malocclusion data), we show that the bagged estimator is -in most cases- consistent with the true causal effect if the structure is known, has a good variance/bias trade-off when the structure is unknown (estimated using heuristics), has lower computational complexity than learning a full network, and outperforms boosted regression. In conclusion, the bagged random causal network is well-suited to estimate query-target causal effects from observational studies on EHR and other high-dimensional biomedical databases.
Collapse
|
229
|
Macieira TGR, Chianca TCM, Smith MB, Yao Y, Bian J, Wilkie DJ, Dunn Lopez K, Keenan GM. Secondary use of standardized nursing care data for advancing nursing science and practice: a systematic review. J Am Med Inform Assoc 2021; 26:1401-1411. [PMID: 31188439 DOI: 10.1093/jamia/ocz086] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/04/2019] [Accepted: 05/09/2019] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE The study sought to present the findings of a systematic review of studies involving secondary analyses of data coded with standardized nursing terminologies (SNTs) retrieved from electronic health records (EHRs). MATERIALS AND METHODS We identified studies that performed secondary analysis of SNT-coded nursing EHR data from PubMed, CINAHL, and Google Scholar. We screened 2570 unique records and identified 44 articles of interest. We extracted research questions, nursing terminologies, sample characteristics, variables, and statistical techniques used from these articles. An adapted STROBE (Strengthening The Reporting of OBservational Studies in Epidemiology) Statement checklist for observational studies was used for reproducibility assessment. RESULTS Forty-four articles were identified. Their study foci were grouped into 3 categories: (1) potential uses of SNT-coded nursing data or challenges associated with this type of data (feasibility of standardizing nursing data), (2) analysis of SNT-coded nursing data to describe the characteristics of nursing care (characterization of nursing care), and (3) analysis of SNT-coded nursing data to understand the impact or effectiveness of nursing care (impact of nursing care). The analytical techniques varied including bivariate analysis, data mining, and predictive modeling. DISCUSSION SNT-coded nursing data extracted from EHRs is useful in characterizing nursing practice and offers the potential for demonstrating its impact on patient outcomes. CONCLUSIONS Our study provides evidence of the value of SNT-coded nursing data in EHRs. Future studies are needed to identify additional useful methods of analyzing SNT-coded nursing data and to combine nursing data with other data elements in EHRs to fully characterize the patient's health care experience.
Collapse
|
230
|
Duan R, Chen Z, Tong J, Luo C, Lyu T, Tao C, Maraganore D, Bian J, Chen Y. Leverage Real-world Longitudinal Data in Large Clinical Research Networks for Alzheimer's Disease and Related Dementia (ADRD). AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:393-401. [PMID: 33936412 PMCID: PMC8075520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
With vast amounts ofpatients' medical information, electronic health records (EHRs) are becoming one of the most important data sources in biomedical and health care research. Effectively integrating data from multiple clinical sites can help provide more generalized real-world evidence that is clinically meaningful. To analyze the clinical data from multiple sites, distributed algorithms are developed to protect patient privacy without sharing individual-level medical information. In this paper, we applied the One-shot Distributed Algorithm for Cox proportional hazard model (ODAC) to the longitudinal data from the OneFlorida Clinical Research Consortium to demonstrate the feasibility of implementing the distributed algorithms in large research networks. We studied the associations between the clinical risk factors and Alzheimer's disease and related dementia (ADRD) onsets to advance clinical research on our understanding of the complex risk factors of ADRD and ultimately improve the care of ADRD patients.
Collapse
|
231
|
Guo Y, He X, Lyu T, Zhang H, Wu Y, Yang X, Chen Z, Markham MJ, Modave F, Xie M, Hogan W, Harle CA, Shenkman EA, Bian J. Developing and Validating a Computable Phenotype for the Identification of Transgender and Gender Nonconforming Individuals and Subgroups. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:514-523. [PMID: 33936425 PMCID: PMC8075543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Transgender and gender nonconforming (TGNC) individuals face significant marginalization, stigma, and discrimination. Under-reporting of TGNC individuals is common since they are often unwilling to self-identify. Meanwhile, the rapid adoption of electronic health record (EHR) systems has made large-scale, longitudinal real-world clinical data available to research and provided a unique opportunity to identify TGNC individuals using their EHRs, contributing to a promising routine health surveillance approach. Built upon existing work, we developed and validated a computable phenotype (CP) algorithm for identifying TGNC individuals and their natal sex (i.e., male-to-female or female-to-male) using both structured EHR data and unstructured clinical notes. Our CP algorithm achieved a 0.955 F1-score on the training data and a perfect F1-score on the independent testing data. Consistent with the literature, we observed an increasing percentage of TGNC individuals and a disproportionate burden of adverse health outcomes, especially sexually transmitted infections and mental health distress, in this population.
Collapse
|
232
|
Li Q, Guo Y, He Z, Zhang H, George TJ, Bian J. Using Real-World Data to Rationalize Clinical Trials Eligibility Criteria Design: A Case Study of Alzheimer's Disease Trials. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:717-726. [PMID: 33936446 PMCID: PMC8075542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Low trial generalizability is a concern. The Food and Drug Administration had guidance on broadening trial eligibility criteria to enroll underrepresented populations. However, investigators are hesitant to do so because of concerns over patient safety. There is a lack of methods to rationalize criteria design. In this study, we used data from a large research network to assess how adjustments of eligibility criteria can jointly affect generalizability and patient safety (i.e the number of serious adverse events [SAEs]). We first built a model to predict the number of SAEs. Then, leveraging an a priori generalizability assessment algorithm, we assessed the changes in the number of predicted SAEs and the generalizability score, simulating the process of dropping exclusion criteria and increasing the upper limit of continuous eligibility criteria. We argued that broadening of eligibility criteria should balance between potential increases of SAEs and generalizability using donepezil trials as a case study.
Collapse
|
233
|
Tong J, Chen Z, Duan R, Lo-Ciganic WH, Lyu T, Tao C, Merkel PA, Kranzler HR, Bian J, Chen Y. Identifying Clinical Risk Factors for Opioid Use Disorder using a Distributed Algorithm to Combine Real-World Data from a Large Clinical Data Research Network. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:1220-1229. [PMID: 33936498 PMCID: PMC8075517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Because they contain detailed individual-level data on various patient characteristics including their medical conditions and treatment histories, electronic health record (EHR) systems have been widely adopted as an efficient source for health research. Compared to data from a single health system, real-world data (RWD) from multiple clinical sites provide a larger and more generalizable population for accurate estimation, leading to better decision making for health care. However, due to concerns over protecting patient privacy, it is challenging to share individual patient-level data across sites in practice. To tackle this issue, many distributed algorithms have been developed to transfer summary-level statistics to derive accurate estimates. Nevertheless, many of these algorithms require multiple rounds of communication to exchange intermediate results across different sites. Among them, the One-shot Distributed Algorithm for Logistic regression (termed ODAL) was developed to reduce communication overhead while protecting patient privacy. In this paper, we applied the ODAL algorithm to RWD from a large clinical data research network-the OneFlorida Clinical Research Consortium and estimated the associations between risk factors and the diagnosis of opioid use disorder (OUD) among individuals who received at least one opioid prescription. The ODAL algorithm provided consistent findings of the associated risk factors and yielded better estimates than meta-analysis.
Collapse
|
234
|
Abe K, Bronner C, Hayato Y, Ikeda M, Imaizumi S, Ito H, Kameda J, Kataoka Y, Miura M, Moriyama S, Nagao Y, Nakahata M, Nakajima Y, Nakayama S, Okada T, Okamoto K, Orii A, Pronost G, Sekiya H, Shiozawa M, Sonoda Y, Suzuki Y, Takeda A, Takemoto Y, Takenaka A, Tanaka H, Yano T, Akutsu R, Han S, Kajita T, Okumura K, Tashiro T, Wang R, Xia J, Bravo-Berguño D, Labarga L, Marti L, Zaldivar B, Blaszczyk F, Kearns E, Gustafson J, Raaf J, Stone J, Wan L, Wester T, Bian J, Griskevich N, Kropp W, Locke S, Mine S, Smy M, Sobel H, Takhistov V, Weatherly P, Hill J, Kim J, Lim I, Park R, Bodur B, Scholberg K, Walter C, Coffani A, Drapier O, El Hedri S, Giampaolo A, Gonin M, Mueller T, Paganini P, Quilain B, Ishizuka T, Nakamura T, Jang J, Learned J, Anthony L, Sztuc A, Uchida Y, Berardi V, Catanesi M, Radicioni E, Calabria N, Machado L, De Rosa G, Collazuol G, Iacob F, Lamoureux M, Ospina N, Ludovici L, Nishimura Y, Cao S, Friend M, Hasegawa T, Ishida T, Kobayashi T, Matsubara T, Nakadaira T, Jakkapu M, Nakamura K, Oyama Y, Sakashita K, Sekiguchi T, Tsukamoto T, Nakano Y, Shiozawa T, Suzuki A, Takeuchi Y, Yamamoto S, Ali A, Ashida Y, Feng J, Hirota S, Ichikawa A, Kikawa T, Mori M, Nakaya T, Wendell R, Yasutome K, Fernandez P, McCauley N, Mehta P, Pritchard A, Tsui K, Fukuda Y, Itow Y, Menjo H, Niwa T, Sato K, Tsukada M, Mijakowski P, Posiadala-Zezula M, Jung C, Vilela C, Wilking M, Yanagisawa C, Harada M, Hagiwara K, Horai T, Ishino H, Ito S, Koshio Y, Ma W, Piplani N, Sakai S, Kuno Y, Barr G, Barrow D, Cook L, Goldsack A, Samani S, Simpson C, Wark D, Nova F, Boschi T, Di Lodovico F, Molina Sedgwick S, Taani M, Zsoldos S, Yang J, Jenkins S, McElwee J, Thiesse M, Thompson L, Malek M, Stone O, Okazawa H, Kim S, Yu I, Nishijima K, Koshiba M, Ogawa N, Iwamoto K, Yokoyama M, Martens K, Vagins M, Kuze M, Izumiyama S, Tanaka M, Yoshida T, Inomoto M, Ishitsuka M, Matsumoto R, Ohta K, Shinoki M, Martin J, Tanaka H, Towstego T, Hartz M, Konaka A, de Perio P, Prouse N, Pointon B, Chen S, Xu B, Richards B, Jamieson B, Walker J, Minamino A, Okamoto K, Pintaudi G, Sasaki R. Neutron-antineutron oscillation search using a 0.37 megaton-years exposure of Super-Kamiokande. Int J Clin Exp Med 2021. [DOI: 10.1103/physrevd.103.012008] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
235
|
Guo Z, Qiu C, Mecca C, Zhang Y, Bian J, Wang Y, Wu X, Wang T, Su W, Li X, Zhang W, Chen B, Xiang H. Elevated lymphotoxin-α (TNFβ) is associated with intervertebral disc degeneration. BMC Musculoskelet Disord 2021; 22:77. [PMID: 33441130 PMCID: PMC7807514 DOI: 10.1186/s12891-020-03934-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Accepted: 12/28/2020] [Indexed: 11/25/2022] Open
Abstract
Background Intervertebral disc degeneration (IVDD) is a primary cause of degenerative disc diseases; however, the mechanisms underlying the degeneration remain unclear. The immunoinflammatory response plays an important role in IVDD progression. The inflammatory cytokine lymphotoxin-α (LTα), formerly known as TNFβ, is associated with various pathological conditions, while its role in the pathogenesis of IVDD remains elusive. Methods Real-time quantitative polymerase chain reaction (RT-qPCR), Western blotting (WB), and enzyme-linked immunosorbent assays were used to assess the levels of LTα in human nucleus pulposus (NP) tissues between degeneration and control groups. The plasma concentrations of LTα and C-reactive protein (CRP) were compared between healthy and IVDD patients. Rat primary NP cells were cultured and identified via immunofluorescence. Methyl-thiazolyl-tetrazolium assays and flow cytometry were used to evaluate the effects of LTα on rat NP cell viability. After NP cells were treated with LTα, degeneration-related molecules (Caspase-3, Caspase-1, matrix metalloproteinase (MMP) -3, aggrecan and type II collagen) were measured via RT-qPCR and WB. Results The levels of both the mRNA and protein of LTα in human degenerated NP tissue significantly increased. Plasma LTα and CRP did not differ between healthy controls and IVDD patients. Rat primary NP cells were cultured, and the purity of primary NP cells was > 90%. Cell experiments showed inversely proportional relationships among the LTα dose, treatment time, and cell viability. The optimal conditions (dose and time) for LTα treatment to induce rat NP cell degeneration were 5 μg/ml and 48 ~ 72 h. The apoptosis rate and the levels of Caspase-3, Caspase-1, and MMP-3 significantly increased after LTα treatment, while the levels of type II collagen and aggrecan were decreased, and the protein expression levels were consistent with their mRNA expression levels. Conclusions This study demonstrated that elevated LTα is closely associated with IVDD and that LTα may induce NP cell apoptosis and reduce important extracellular matrix (ECM) proteins, which cause adverse effects on IVDD progress. Moreover, the optimal conditions for LTα treatment to induce NP cell degeneration were determined. Supplementary Information The online version contains supplementary material available at 10.1186/s12891-020-03934-7.
Collapse
|
236
|
Staras SAS, Richardson E, Merlo LJ, Bian J, Thompson LA, Krieger JL, Gurka MJ, Sanders AH, Shenkman EA. A feasibility trial of parent HPV vaccine reminders and phone-based motivational interviewing. BMC Public Health 2021; 21:109. [PMID: 33422047 PMCID: PMC7797089 DOI: 10.1186/s12889-020-10132-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 12/25/2020] [Indexed: 11/28/2022] Open
Abstract
Background We assessed the feasibility and acceptability of a sequential approach of parent-targeted HPV vaccine reminders and phone-based Motivation Interviewing (MI). Methods In 2016, we selected all 11- to 12-year-old boys and girls seen in one clinic whose vaccine records did not include the HPV vaccine (n=286). By gender, we individually randomized parents of adolescents to an interactive text message (74 girls and 45 boys), postcard reminder (46 boys and no girls because of previously demonstrated efficacy), or standard care group (75 girls and 46 boys). Reminders were sent with medical director permission and a HIPAA waiver. Two months after reminders, among the adolescents whose vaccine records still did not include the HPV vaccine, we selected a gender-stratified random sample of 20 parents for phone-based MI. We assessed the percentage of deliverable messages, the percentage of parents’ responding to the interactive text message, parent acceptability of receiving a text message, and MI parent responsiveness and interviewer competence (MI Treatment Integrity Coding system). Results Nearly all messages were deliverable (98% of postcards and 74% of text messages). Six of the 88 parents (7%) receiving text messages scheduled an appointment through our interactive system. The acceptability survey response rate was 37% (38/102). Respondents were favorable toward vaccine reminders for all parents (82%). Among 20 sampled parents, 17 were reached by phone of whom 7 completed MI, 4 had or were getting the HPV vaccine for their child, and 5 expressed disinterest. Across the 7 MI calls, the interviewer was rated 100% MI adherent and scored an average 4.19 rating for Global Spirit. Conclusion Without providing explicit consent to receive vaccine-related messages, parents nonetheless found postcards and interactive text messages acceptable. Centralizing MI to phone calls with trained staff was acceptable to parents and resulted in highly MI-adherent interviews. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-020-10132-6.
Collapse
|
237
|
Yu Z, Yang X, Dang C, Wu S, Adekkanattu P, Pathak J, George TJ, Hogan WR, Guo Y, Bian J, Wu Y. A Study of Social and Behavioral Determinants of Health in Lung Cancer Patients Using Transformers-based Natural Language Processing Models. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2021:1225-1233. [PMID: 35309014 PMCID: PMC8861705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/14/2023]
Abstract
Social and behavioral determinants of health (SBDoH) have important roles in shaping people's health. In clinical research studies, especially comparative effectiveness studies, failure to adjust for SBDoH factors will potentially cause confounding issues and misclassification errors in either statistical analyses and machine learning-based models. However, there are limited studies to examine SBDoH factors in clinical outcomes due to the lack of structured SBDoH information in current electronic health record (EHR) systems, while much of the SBDoH information is documented in clinical narratives. Natural language processing (NLP) is thus the key technology to extract such information from unstructured clinical text. However, there is not a mature clinical NLP system focusing on SBDoH. In this study, we examined two state-of-the-art transformer-based NLP models, including BERT and RoBERTa, to extract SBDoH concepts from clinical narratives, applied the best performing model to extract SBDoH concepts on a lung cancer screening patient cohort, and examined the difference of SBDoH information between NLP extracted results and structured EHRs (SBDoH information captured in standard vocabularies such as the International Classification of Diseases codes). The experimental results show that the BERT-based NLP model achieved the best strict/lenient F1-score of 0.8791 and 0.8999, respectively. The comparison between NLP extracted SBDoH information and structured EHRs in the lung cancer patient cohort of 864 patients with 161,933 various types of clinical notes showed that much more detailed information about smoking, education, and employment were only captured in clinical narratives and that it is necessary to use both clinical narratives and structured EHRs to construct a more complete picture of patients' SBDoH factors.
Collapse
|
238
|
Bishnoi R, Xie Z, Shah C, Bian J, Murthy HS, Wingard JR, Farhadfar N. Real-world experience of carfilzomib-associated cardiovascular adverse events: SEER-Medicare data set analysis. Cancer Med 2021; 10:70-78. [PMID: 33169938 PMCID: PMC7826471 DOI: 10.1002/cam4.3568] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 09/23/2020] [Accepted: 10/05/2020] [Indexed: 01/08/2023] Open
Abstract
Carfilzomib was approved for the treatment of multiple myeloma in 2012 and since then there have been concerns for cardiovascular toxicity from its use. With this study, we aim to further study the hazards and underlying risk factors for cardiovascular adverse events associated with carfilzomib. This study was conducted using Surveillance, Epidemiology, and End Results (SEER)-Medicare data set of multiple myeloma from 2001 to 2015. Data were analyzed for hazards ratio of cardiovascular adverse events between carfilzomib users and nonusers. We identified 7330 patients with multiple myeloma of whom 815 were carfilzomib users. Carfilzomib users had a statistically significant hazard ratio of 1.41 with p < 0.0001 for all cardiovascular adverse events as compared to nonusers. Carfilzomib use was significantly associated with increased risk of heart failure (HR 1.47, p = 0.0002), ischemic heart disease (HR 1.45, p = 0.0002), and hypertension (HR 3.33, p < 0.0001), whereas there was no association between carfilzomib use and cardiac conduction disorders (arrhythmia and heart blocks). Carfilzomib users were at higher risk of new-onset edema (HR 5.09, p < 0.0001), syncope (HR 4.27, p < 0.0001), dyspnea (HR 1.33, p < 0.0001), and chest pain (HR 1.18, p < 0.0001) as compared to carfilzomib nonusers. Age above 75 years, preexisting cardiovascular disease, obesity, and twice a week carfilzomib schedule were significant risk factors associated with cardiovascular adverse events in carfilzomib users. The median time of the onset for all cardiovascular adverse events was 3.1 months. This study has identified a significantly higher likelihood of cardiovascular adverse events in elderly Medicare patients receiving carfilzomib.
Collapse
|
239
|
Pan J, Luo X, Bian J, Shao T, Li C, Zhao T, Zhang S, Zhou F, Wang G. Identification of Genomic Islands in Synechococcus sp. WH8102 Using Genomic Barcode and Whole-Genome Microarray Analysis. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200121160615] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Synechococcus sp. WH8102 is one of the most abundant photosynthetic organisms in many ocean regions.
Objective:
The aim of this study is to identify genomic islands (GIs) in Synechococcus sp. WH8102 with integrated methods.
Methods:
We have applied genomic barcode to identify the GIs in Synechococcus sp. WH8102, which could make genomic regions of different origins visually apparent. The gene expression data of the predicted GIs was analyzed through microarray data which was collected for functional analysis of the relevant genes.
Results:
Seven GIs were identified in Synechococcus sp. WH8102. Most of them are involved in cell surface modification, photosynthesis and drug resistance. In addition, our analysis also revealed the functions of these GIs, which could be used for in-depth study on the evolution of this strain.
Conclusion:
Genomic barcodes provide us with a comprehensive and intuitive view of the target genome. We can use it to understand the intrinsic characteristics of the whole genome and identify GIs or other similar elements.
Collapse
|
240
|
Guo Y, Chen Z, Xu K, George TJ, Wu Y, Hogan W, Shenkman EA, Bian J. International Classification of Diseases, Tenth Revision, Clinical Modification social determinants of health codes are poorly used in electronic health records. Medicine (Baltimore) 2020; 99:e23818. [PMID: 33350768 PMCID: PMC7769291 DOI: 10.1097/md.0000000000023818] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 11/19/2020] [Indexed: 11/26/2022] Open
Abstract
There have been increasing calls for clinicians to document social determinants of health (SDOH) in electronic health records (EHRs). One potential source of SDOH in the EHRs is in the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) Z codes (Z55-Z65). In February 2018, ICD-10-CM Official Guidelines for Coding and Reporting approved that all clinicians, not just the physicians, involved in the care of a patient can document SDOH using these Z codes.To examine the utilization rate of the ICD-10-CM Z codes using data from a large network of EHRs.We conducted a retrospective analysis of EHR data between 2015 to 2018 in the OneFlorida Clinical Research Consortium, 1 of the 13 Clinical Data Research Networks funded by Patient-Centered Outcomes Research Institute. We calculated the Z code utilization rate at both the encounter and patient levels.We found a low rate of utilization for these Z codes (270.61 per 100,000 at the encounter level and 2.03% at the patient level). We also found that the rate of utilization for these Z codes increased (from 255.62 to 292.79 per 100,000) since the official approval of Z code reporting from all clinicians by the American Hospital Association Coding Clinic and ICD-10-CM Official Guidelines for Coding and Reporting became effective in February 2018.The SDOH Z codes are rarely used by clinicians. Providing clear guidelines and incentives for documenting the Z codes can promote their use in EHRs. Improvements in the EHR systems are probably needed to better document SDOH.
Collapse
|
241
|
Chen Z, Liu X, Hogan W, Shenkman E, Bian J. Applications of artificial intelligence in drug development using real-world data. Drug Discov Today 2020; 26:1256-1264. [PMID: 33358699 DOI: 10.1016/j.drudis.2020.12.013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 11/21/2020] [Accepted: 12/16/2020] [Indexed: 01/12/2023]
Abstract
The US Food and Drug Administration (FDA) has been actively promoting the use of real-world data (RWD) in drug development. RWD can generate important real-world evidence reflecting the real-world clinical environment where the treatments are used. Meanwhile, artificial intelligence (AI), especially machine- and deep-learning (ML/DL) methods, have been increasingly used across many stages of the drug development process. Advancements in AI have also provided new strategies to analyze large, multidimensional RWD. Thus, we conducted a rapid review of articles from the past 20 years, to provide an overview of the drug development studies that use both AI and RWD. We found that the most popular applications were adverse event detection, trial recruitment, and drug repurposing. Here, we also discuss current research gaps and future opportunities.
Collapse
|
242
|
Yang X, Zhang H, He X, Bian J, Wu Y. Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models. JMIR Med Inform 2020; 8:e22982. [PMID: 33320104 PMCID: PMC7772072 DOI: 10.2196/22982] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Revised: 10/05/2020] [Accepted: 11/20/2020] [Indexed: 12/16/2022] Open
Abstract
Background Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. Objective This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. Methods We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. Results Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. Conclusions This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.
Collapse
|
243
|
He Z, Erdengasileng A, Luo X, Xing A, Charness N, Bian J. How the clinical research community responded to the COVID-19 pandemic: An analysis of the COVID-19 clinical studies in ClinicalTrials.gov. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2020:2020.09.16.20195552. [PMID: 32995807 PMCID: PMC7523146 DOI: 10.1101/2020.09.16.20195552] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE The novel coronavirus disease (COVID-19), broke out in December 2019, and is now a global pandemic. In the past few months, a large number of clinical studies have been initiated worldwide to find effective therapeutics, vaccines, and preventive strategies for COVID-19. In this study, we aim to understand the landscape of COVID-19 clinical research and identify the gaps such as the lack of population representativeness and issues that may cause recruitment difficulty. MATERIALS AND METHODS We analyzed 3,765 COVID-19 studies registered in the largest public registry - ClinicalTrials.gov, leveraging natural language processing and using descriptive, association, and clustering analyses. We first characterized COVID-19 studies by study features such as phase and tested intervention. We then took a deep dive and analyzed their eligibility criteria to understand whether these studies: (1) considered the reported underlying health conditions that may lead to severe illnesses, and (2) excluded older adults, either explicitly or implicitly, which may reduce the generalizability of these studies to the older adults population. RESULTS Most trials did not have an upper age limit and did not exclude patients with common chronic conditions such as hypertension and diabetes that are more prevalent in older adults. However, known risk factors that may lead to severe illnesses have not been adequately considered. CONCLUSIONS A careful examination of existing COVID-19 studies can inform future COVID-19 trial design towards balanced internal validity and generalizability.
Collapse
|
244
|
He Z, Tao C, Bian J, Zhang R. Selected articles from the Fourth International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). BMC Med Inform Decis Mak 2020; 20:315. [PMID: 33317524 PMCID: PMC7734704 DOI: 10.1186/s12911-020-01292-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
In this introduction, we first summarize the Fourth International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019) held on October 26, 2019 in conjunction with the 18th International Semantic Web Conference (ISWC 2019) in Auckland, New Zealand, and then briefly introduce seven research articles included in this supplement issue, covering the topics on Knowledge Graph, Ontology-Powered Analytics, and Deep Learning.
Collapse
|
245
|
Zhang H, Guo Y, Prosperi M, Bian J. An ontology-based documentation of data discovery and integration process in cancer outcomes research. BMC Med Inform Decis Mak 2020; 20:292. [PMID: 33317497 PMCID: PMC7734720 DOI: 10.1186/s12911-020-01270-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 09/17/2020] [Indexed: 01/24/2023] Open
Abstract
Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.
Collapse
|
246
|
Guo Y, Wheldon CW, Shao H, Pepine CJ, Handberg EM, Shenkman EA, Bian J. Statin Use for Atherosclerotic Cardiovascular Disease Prevention Among Sexual Minority Adults. J Am Heart Assoc 2020; 9:e018233. [PMID: 33317368 PMCID: PMC7955377 DOI: 10.1161/jaha.120.018233] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Background Sexual minority, or lesbian, gay, and bisexual (LGB), individuals are at increased risk for cardiovascular disease attributable to elevated rates of health risk factors. However, although there is clear evidence that statin use can prevent cardiovscular disease in certain adult populations, no studies have examined how statins are being used among the LGB population. This study aimed to examine the prevalence and predictors of statin use among LGB and non‐LGB individuals using Facebook‐delivered online surveys. Methods and Results We conducted a cross‐sectional online survey about statin use in adults ≥40 years of age between September and December 2019 using Facebook advertising (n=1531). We calculated the prevalence of statin use by age, sexual orientation, and statin benefit populations. We used multivariable logistic regression to examine whether statin use differed by sexual orientation, adjusting for covariates. We observed a significantly lower rate of statin use in the LGB versus non‐LGB respondents (20.8% versus 43.8%; P<0.001) in the primary prevention population. However, the prevalence of statin use was not statistically different in the LGB versus non‐LGB respondents in the secondary prevention population. Adjusting for the covariates, the LGB participants were less likely to use statins than the non‐LGB respondents in the primary prevention population (odds ratio, 0.37; 95% CI, 0.19–0.70). Conclusions Our results are the first to emphasize the urgent need for tailored, evidence‐based cardiovascular disease prevention programs that aim to promote statin use, and thus healthy aging, in the LGB population.
Collapse
|
247
|
Bishnoi R, Shah C, Blaes A, Bian J, Hong YR. Cardiovascular toxicity in patients treated with immunotherapy for metastatic non-small cell lung cancer: A SEER-medicare study. Lung Cancer 2020; 150:172-177. [DOI: 10.1016/j.lungcan.2020.10.017] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 10/19/2020] [Accepted: 10/23/2020] [Indexed: 12/11/2022]
|
248
|
Huo J, Hong YR, Turner K, Diaby V, Chen C, Bian J, Grewal R, Wilkie DJ. Timing, Costs, and Survival Outcome of Specialty Palliative Care in Medicare Beneficiaries With Metastatic Non–Small-Cell Lung Cancer. JCO Oncol Pract 2020; 16:e1532-e1542. [DOI: 10.1200/op.20.00298] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE: ASCO recommends early integration of palliative care in treating patients diagnosed with metastatic lung cancer. Our study sought to examine utilization of timely specialty palliative care (SPC) and its association with survival and cost outcomes in patients diagnosed with metastatic non–small-cell lung cancer (NSCLC). METHODS: The 2001-2015 SEER-Medicare data were used to determine the baseline characteristics and outcomes of 79,253 patients with metastatic NSCLC. The predictors of early SPC use were examined using logistic regression. Mean and adjusted total and SPC-related costs were calculated using generalized linear regression. We used Cox regression model to determine the survival outcomes by SPC service settings. All statistical tests were two sided. RESULTS: The time from cancer diagnosis to the first SPC use has reduced significantly, from 13.7 weeks in 2001 to 8.3 weeks in 2015 ( P < .001). SPC use was associated with lower health care costs compared with those who had no SPC, from −$3,180 in 2011 ( P < .001) to −$1,285 in 2015 ( P = .059). Outpatient SPC use was associated with improved survival compared with patients who received SPC in other settings (hazard ratio, 0.83; 95% CI, 0.79 to 0.88; P < .001). CONCLUSION: Patients diagnosed with metastatic NSCLC now have more timely SPC service utilization, which was demonstrated to be a cost-saving treatment. Strategies to improve outpatient palliative care use might be associated with longer survival in patients with metastatic NSCLC.
Collapse
|
249
|
Marra DE, Miller AH, Li Q, Yang X, Smith GE, Wu Y, Bian J, Maraganore DM. Utilizing electronic medical record data to predict onset of Alzheimer’s disease and related dementias. Alzheimers Dement 2020. [DOI: 10.1002/alz.041233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
250
|
Yang X, He X, Zhang H, Ma Y, Bian J, Wu Y. Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models. JMIR Med Inform 2020; 8:e19735. [PMID: 33226350 PMCID: PMC7721552 DOI: 10.2196/19735] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 10/19/2020] [Accepted: 10/26/2020] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS. OBJECTIVE This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS. METHODS In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models. RESULTS Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010). CONCLUSIONS This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.
Collapse
|