1
|
Ferdowsi S, Knafou J, Borissov N, Vicente Alvarez D, Mishra R, Amini P, Teodoro D. Deep learning-based risk prediction for interventional clinical trials based on protocol design: A retrospective study. PATTERNS (NEW YORK, N.Y.) 2023; 4:100689. [PMID: 36960445 PMCID: PMC10028430 DOI: 10.1016/j.patter.2023.100689] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 11/07/2022] [Accepted: 01/16/2023] [Indexed: 02/12/2023]
Abstract
Success rate of clinical trials (CTs) is low, with the protocol design itself being considered a major risk factor. We aimed to investigate the use of deep learning methods to predict the risk of CTs based on their protocols. Considering protocol changes and their final status, a retrospective risk assignment method was proposed to label CTs according to low, medium, and high risk levels. Then, transformer and graph neural networks were designed and combined in an ensemble model to learn to infer the ternary risk categories. The ensemble model achieved robust performance (area under the receiving operator characteristic curve [AUROC] of 0.8453 [95% confidence interval: 0.8409-0.8495]), similar to the individual architectures but significantly outperforming a baseline based on bag-of-words features (0.7548 [0.7493-0.7603] AUROC). We demonstrate the potential of deep learning in predicting the risk of CTs from their protocols, paving the way for customized risk mitigation strategies during protocol design.
Collapse
Affiliation(s)
- Sohrab Ferdowsi
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, Geneva, Switzerland
| | - Julien Knafou
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, Geneva, Switzerland
| | - Nikolay Borissov
- Clinical Trials Unit, University of Bern, Bern, Switzerland
- Risklick AG, Bern, Switzerland
| | - David Vicente Alvarez
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, Geneva, Switzerland
| | - Rahul Mishra
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Poorya Amini
- Clinical Trials Unit, University of Bern, Bern, Switzerland
- Risklick AG, Bern, Switzerland
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, Geneva, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Corresponding author
| |
Collapse
|
2
|
Improving clinical trial design using interpretable machine learning based prediction of early trial termination. Sci Rep 2023; 13:121. [PMID: 36599880 PMCID: PMC9813129 DOI: 10.1038/s41598-023-27416-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Accepted: 01/02/2023] [Indexed: 01/06/2023] Open
Abstract
This study proposes using a machine learning pipeline to optimise clinical trial design. The goal is to predict early termination probability of clinical trials using machine learning modelling, and to understand feature contributions driving early termination. This will inform further suggestions to the study protocol to reduce the risk of wasted resources. A dataset containing 420,268 clinical trial records and 24 fields was extracted from the ct.gov registry. In addition to study characteristics features, 12,864 eligibility criteria search features are used, generated using a public annotated eligibility criteria dataset, CHIA. Furthermore, disease categorization features are used allowing a study to belong more than one category specified by clinicaltrials.gov. Ensemble models including random forest and extreme gradient boosting classifiers were used to train and evaluate predictive performance. We achieved a Receiver Operator Characteristic Area under the Curve score of 0.80, and balanced accuracy of 0.70 on the test set using gradient boosting classification. We used Shapley Additive Explanations to interpret the termination predictions to flag feature contributions. The proposed pipeline will lead to an optimised clinical trial design and consequently help potentially life-saving treatments reach patients faster.
Collapse
|
3
|
Eysenbach G, Šuster S, Baldwin T, Verspoor K. Predicting Publication of Clinical Trials Using Structured and Unstructured Data: Model Development and Validation Study. J Med Internet Res 2022; 24:e38859. [PMID: 36563029 PMCID: PMC9823568 DOI: 10.2196/38859] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 10/14/2022] [Accepted: 11/16/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Publication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on the predictive analysis of a trial's publishability given an individual (planned) clinical trial description. OBJECTIVE We aimed to conduct a study that combined structured and unstructured features relevant to publication status in a single predictive approach. Established natural language processing techniques as well as recent pretrained language models enabled us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We were particularly interested in whether and which textual features could improve the classification accuracy for publication outcomes. METHODS In this study, we used metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a database of academic journal articles) to build a data set of clinical trials (N=76,950) that contained the description of a registered trial and its publication outcome (27,702/76,950, 36% published and 49,248/76,950, 64% unpublished). This is the largest data set of its kind, which we released as part of this work. The publication outcome in the data set was identified from MEDLINE based on clinical trial identifiers. We carried out a descriptive analysis and predicted the publication outcome using 2 approaches: a neural network with a large domain-specific language model and a random forest classifier using a weighted bag-of-words representation of text. RESULTS First, our analysis of the newly created data set corroborates several findings from the existing literature regarding attributes associated with a higher publication rate. Second, a crucial observation from our predictive modeling was that the addition of textual features (eg, eligibility criteria) offers consistent improvements over using only structured data (F1-score=0.62-0.64 vs F1-score=0.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two. CONCLUSIONS Different factors affect the publication of a registered clinical trial. Our approach to predictive modeling combines heterogeneous features, both structured and unstructured. We show that methods from natural language processing can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task previously.
Collapse
Affiliation(s)
| | - Simon Šuster
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Timothy Baldwin
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia.,Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
4
|
Abstract
AbstractAiming at the problems of low accuracy, the long time required, and the large memory consumption of traditional data mining methods, a local discrete text data mining method in high-dimensional data space is proposed. First of all, through the data preparation and preprocessing step, we obtain the minimum data divergence and maximize the data dimension to meet the demand for data in high-dimensional space; second, we use the information gain method to mine the pre-processed discrete text data to establish an objective function to obtain the highest information gain; finally, the objective functions established in data preparation, preprocessing, and mining are combined to form a multi-objective optimization problem to realize local discrete text data mining. The simulation experiment results show that our method effectively reduces the time and improves the accuracy of data mining, where it also consumes less memory, indicating that the multi-objective optimization method can effectively solve multiple problems and effectively improve the data mining effect.
Collapse
|
5
|
A Novel Text Classification Technique Using Improved Particle Swarm Optimization: A Case Study of Arabic Language. FUTURE INTERNET 2022. [DOI: 10.3390/fi14070194] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
We propose a novel text classification model, which aims to improve the performance of Arabic text classification using machine learning techniques. One of the effective solutions in Arabic text classification is to find the suitable feature selection method with an optimal number of features alongside the classifier. Although several text classification methods have been proposed for the Arabic language using different techniques, such as feature selection methods, an ensemble of classifiers, and discriminative features, choosing the optimal method becomes an NP-hard problem considering the huge search space. Therefore, we propose a method, called Optimal Configuration Determination for Arabic text Classification (OCATC), which utilized the Particle Swarm Optimization (PSO) algorithm to find the optimal solution (configuration) from this space. The proposed OCATC method extracts and converts the features from the textual documents into a numerical vector using the Term Frequency-Inverse Document Frequency (TF–IDF) approach. Finally, the PSO selects the best architecture from a set of classifiers to feature selection methods with an optimal number of features. Extensive experiments were carried out to evaluate the performance of the OCATC method using six datasets, including five publicly available datasets and our proposed dataset. The results obtained demonstrate the superiority of OCATC over individual classifiers and other state-of-the-art methods.
Collapse
|
6
|
Alharbey R, Kim JI, Daud A, Song M, Alshdadi AA, Hayat MK. Indexing important drugs from medical literature. Scientometrics 2022. [DOI: 10.1007/s11192-022-04340-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
7
|
On Graph Construction for Classification of Clinical Trials Protocols Using Graph Neural Networks. Artif Intell Med 2022. [DOI: 10.1007/978-3-031-09342-5_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Kim B, Jang YJ, Cho HR, Kim SY, Jeong JE, Shim MK, Kim MG. Predicting completion of clinical trials in pregnant women: Cox proportional hazard and neural network models. Clin Transl Sci 2021; 15:691-699. [PMID: 34735737 PMCID: PMC8932703 DOI: 10.1111/cts.13187] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/25/2021] [Accepted: 10/21/2021] [Indexed: 12/01/2022] Open
Abstract
This study aimed to develop a model for predicting the completion of clinical trials involving pregnant women using the Cox proportional hazard model and neural network model (DeepSurv) and to compare the predictive performance of both methods. We collected data on 819 clinical trials performed on pregnant women and intervention studies using at least one drug as intervention from 2009 to 2018 from ClinicalTrials.gov. The Cox proportional hazard model and DeepSurv were used to develop models that predict clinical trial completion. The concordance index (C‐index) was used to evaluate the predictive performance. The Cox proportional hazard model revealed that a sample size of n ≥ 329 (hazard ratio [HR] = 0.53), very high human development index (HDI) country (HR = 0.28), abortion (HR = 3.30), labor (HR = 2.16), and iron deficiency anemia (HR = 2.29) were significantly related to the probability of clinical trial completion (all p value < 0.01). The C‐index of the model development dataset and test dataset were 0.72 and 0.73, respectively. DeepSurv model consisted of one hidden layer with 16 nodes. DeepSurv showed the C‐index comparable to the Cox proportional hazard model. The C‐index of the training dataset and test dataset were 0.76 and 0.72, respectively. Further a nomogram that calculate a probability of clinical trial completion at 1 year, 3 years, and 5 years was developed. Both the Cox proportional hazard model and DeepSurv yielded sufficient predicting performance. We hope that this study will contribute to the execution of future clinical trials in pregnant women.
Collapse
Affiliation(s)
- Bomee Kim
- Graduate School of Clinical Biohealth, Ewha Womans University, Seoul, Korea
| | - Yun Ji Jang
- College of Pharmacy, CHA University, Pocheon, Korea
| | - Hae Ram Cho
- College of Pharmacy, CHA University, Pocheon, Korea
| | - So Yeon Kim
- College of Pharmacy, CHA University, Pocheon, Korea
| | - Ji Eun Jeong
- College of Pharmacy, CHA University, Pocheon, Korea
| | | | - Myeong Gyu Kim
- College of Pharmacy, Ewha Womans University, Seoul, Korea.,Graduate School of Pharmaceutical Sciences, Ewha Womans University, Seoul, Korea
| |
Collapse
|
9
|
Natural language processing in law: Prediction of outcomes in the higher courts of Turkey. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102684] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
10
|
Understanding and predicting COVID-19 clinical trial completion vs. cessation. PLoS One 2021; 16:e0253789. [PMID: 34252108 PMCID: PMC8274906 DOI: 10.1371/journal.pone.0253789] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Accepted: 06/12/2021] [Indexed: 11/19/2022] Open
Abstract
As of March 30 2021, over 5,193 COVID-19 clinical trials have been registered through Clinicaltrial.gov. Among them, 191 trials were terminated, suspended, or withdrawn (indicating the cessation of the study). On the other hand, 909 trials have been completed (indicating the completion of the study). In this study, we propose to study underlying factors of COVID-19 trial completion vs. cessation, and design predictive models to accurately predict whether a COVID-19 trial may complete or cease in the future. We collect 4,441 COVID-19 trials from ClinicalTrial.gov to build a testbed, and design four types of features to characterize clinical trial administration, eligibility, study information, criteria, drug types, study keywords, as well as embedding features commonly used in the state-of-the-art machine learning. Our study shows that drug features and study keywords are most informative features, but all four types of features are essential for accurate trial prediction. By using predictive models, our approach achieves more than 0.87 AUC (Area Under the Curve) score and 0.81 balanced accuracy to correctly predict COVID-19 clinical trial completion vs. cessation. Our research shows that computational methods can deliver effective features to understand difference between completed vs. ceased COVID-19 trials. In addition, such models can also predict COVID-19 trial status with satisfactory accuracy, and help stakeholders better plan trials and minimize costs.
Collapse
|
11
|
Amara A, Hadj Taieb MA, Ben Aouicha M. Multilingual topic modeling for tracking COVID-19 trends based on Facebook data analysis. APPL INTELL 2021; 51:3052-3073. [PMID: 34764585 PMCID: PMC7881346 DOI: 10.1007/s10489-020-02033-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/21/2020] [Indexed: 11/04/2022]
Abstract
Social data has shown important role in tracking, monitoring and risk management of disasters. Indeed, several works focused on the benefits of social data analysis for the healthcare practices and curing domain. Similarly, these data are exploited now for tracking the COVID-19 pandemic but the majority of works exploited Twitter as source. In this paper, we choose to exploit Facebook, rarely used, for tracking the evolution of COVID-19 related trends. In fact, a multilingual dataset covering 7 languages (English (EN), Arabic (AR), Spanish (ES), Italian (IT), German (DE), French (FR) and Japanese (JP)) is extracted from Facebook public posts. The proposal is an analytics process including a data gathering step, pre-processing, LDA-based topic modeling and presentation module using graph structure. Data analysing covers the duration spanned from January 1st, 2020 to May 15, 2020 divided on three periods in cumulative way: first period January-February, second period March-April and the last one to 15 May. The results showed that the extracted topics correspond to the chronological development of what has been circulated around the pandemic and the measures that have been taken according to the various languages under discussion representing several countries.
Collapse
Affiliation(s)
- Amina Amara
- Multimedia, InfoRmation systems and Advanced Computing Laboratory, University of Sfax, Sfax, Tunisia
| | | | | |
Collapse
|
12
|
Predictive modeling of clinical trial terminations using feature engineering and embedding learning. Sci Rep 2021; 11:3446. [PMID: 33568706 PMCID: PMC7876037 DOI: 10.1038/s41598-021-82840-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 01/25/2021] [Indexed: 11/16/2022] Open
Abstract
In this study, we propose to use machine learning to understand terminated clinical trials. Our goal is to answer two fundamental questions: (1) what are common factors/markers associated to terminated clinical trials? and (2) how to accurately predict whether a clinical trial may be terminated or not? The answer to the first question provides effective ways to understand characteristics of terminated trials for stakeholders to better plan their trials; and the answer to the second question can direct estimate the chance of success of a clinical trial in order to minimize costs. By using 311,260 trials to build a testbed with 68,999 samples, we use feature engineering to create 640 features, reflecting clinical trial administration, eligibility, study information, criteria etc. Using feature ranking, a handful of features, such as trial eligibility, trial inclusion/exclusion criteria, sponsor types etc., are found to be related to the clinical trial termination. By using sampling and ensemble learning, we achieve over 67% Balanced Accuracy and over 0.73 AUC (Area Under the Curve) scores to correctly predict clinical trial termination, indicating that machine learning can help achieve satisfactory prediction results for clinical trial study.
Collapse
|
13
|
Geletta S, Follett L, Laugerman M. Latent Dirichlet Allocation in predicting clinical trial terminations. BMC Med Inform Decis Mak 2019; 19:242. [PMID: 31775737 PMCID: PMC6882341 DOI: 10.1186/s12911-019-0973-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 11/08/2019] [Indexed: 11/10/2022] Open
Abstract
Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.
Collapse
Affiliation(s)
- Simon Geletta
- Department of Public Health, Des Moines University, 169 Ryan Hall, 3200 Grand Ave, Des Moines, IA, USA.
| | - Lendie Follett
- Department of Data Analytics, College of Business and Public Administration, Drake University, Des Moines, IA, USA
| | - Marcia Laugerman
- Department of Data Analytics, College of Business and Public Administration, Drake University, Des Moines, IA, USA
| |
Collapse
|