1
|
Yang Y, Jayaraj S, Ludmir E, Roberts K. Text Classification of Cancer Clinical Trial Eligibility Criteria. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:1304-1313. [PMID: 38222417 PMCID: PMC10785908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Automatic identification of clinical trials for which a patient is eligible is complicated by the fact that trial eligibility are stated in natural language. A potential solution to this problem is to employ text classification methods for common types of eligibility criteria. In this study, we focus on seven common exclusion criteria in cancer trials: prior malignancy, human immunodeficiency virus, hepatitis B, hepatitis C, psychiatric illness, drug/substance abuse, and autoimmune illness. Our dataset consists of 764 phase III cancer trials with these exclusions annotated at the trial level. We experiment with common transformer models as well as a new pre-trained clinical trial BERT model. Our results demonstrate the feasibility of automatically classifying common exclusion criteria. Additionally, we demonstrate the value of a pre-trained language model specifically for clinical trials, which yield the highest average performance across all criteria.
Collapse
Affiliation(s)
- Yumeng Yang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Soumya Jayaraj
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ethan Ludmir
- Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
2
|
Su Q, Cheng G, Huang J. A review of research on eligibility criteria for clinical trials. Clin Exp Med 2023; 23:1867-1879. [PMID: 36602707 PMCID: PMC9815064 DOI: 10.1007/s10238-022-00975-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 12/06/2022] [Indexed: 01/06/2023]
Abstract
The purpose of this paper is to systematically sort out and analyze the cutting-edge research on the eligibility criteria of clinical trials. Eligibility criteria are important prerequisites for the success of clinical trials. It directly affects the final results of the clinical trials. Inappropriate eligibility criteria will lead to insufficient recruitment, which is an important reason for the eventual failure of many clinical trials. We have investigated the research status of eligibility criteria for clinical trials on academic platforms such as arXiv and NIH. We have classified and sorted out all the papers we found, so that readers can understand the frontier research in this field. Eligibility criteria are the most important part of a clinical trial study. The ultimate goal of research in this field is to formulate more scientific and reasonable eligibility criteria and speed up the clinical trial process. The global research on the eligibility criteria of clinical trials is mainly divided into four main aspects: natural language processing, patient pre-screening, standard evaluation, and clinical trial query. Compared with the past, people are now using new technologies to study eligibility criteria from a new perspective (big data). In the research process, complex disease concepts, how to choose a suitable dataset, how to prove the validity and scientific of the research results, are challenges faced by researchers (especially for computer-related researchers). Future research will focus on the selection and improvement of artificial intelligence algorithms related to clinical trials and related practical applications such as databases, knowledge graphs, and dictionaries.
Collapse
Affiliation(s)
- Qianmin Su
- Department of Computer Science, School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, No. 333 Longteng Road, Shanghai, 201620, China.
| | - Gaoyi Cheng
- Department of Computer Science, School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, No. 333 Longteng Road, Shanghai, 201620, China
| | - Jihan Huang
- Center for Drug Clinical Research, Shanghai University of Traditional Chinese Medicine, Shanghai, 201203, China
| |
Collapse
|
3
|
Comprehensive Review and Future Research Directions on Dynamic Faceted Search. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11178113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In modern society, the increasing number of web search operations on various search engines has become ubiquitous due to the significant number of results presented to the users and the incompetent result-ranking mechanism in some domains, such as medical, law, and academia. As a result, the user is overwhelmed with a large number of misranked or uncategorized search results. One of the most promising technologies to reduce the number of results and provide desirable information to the users is dynamic faceted filters. Therefore, this paper extensively reviews related research articles published in IEEE Xplore, Web of Science, and the ACM digital library. As a result, a total of 170 related research papers were considered and organized into five categories. The main contribution of this paper is to provide a detailed analysis of the faceted search’s fundamental attributes, as well as to demonstrate the motivation from the usage, concerns, challenges, and recommendations to enhance the use of the faceted approach among web search service providers.
Collapse
|
4
|
Dhayne H, Kilany R, Haque R, Taher Y. EMR2vec: Bridging the gap between patient data and clinical trial. COMPUTERS & INDUSTRIAL ENGINEERING 2021; 156:107236. [PMID: 33746344 PMCID: PMC7959675 DOI: 10.1016/j.cie.2021.107236] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Revised: 02/17/2021] [Accepted: 03/08/2021] [Indexed: 06/12/2023]
Abstract
The human suffering from diseases caused by life-threatening viruses such as SARS, Ebola, and COVID-19 motivated many of us to study and discover the best means to harness the potential of data integration to assist clinical researchers to curb these viruses. Integrating patients data with clinical trials data is enormously promising as it provides a comprehensive knowledge base that accelerates the clinical research response-ability to tackle emerging infectious disease outbreaks. This work introduces EMR2vec, a platform that customises advanced NLP, machine learning and semantic web techniques to link potential patients to suitable clinical trials. Linking these two different but complementary datasets allows clinicians and researchers to compare patients to clinical research opportunities or to automatically select patients for personalized clinical care. The platform derives a 'bag of medical terms' (BoMT) from eligibility criteria by normalizing extracted entities through SNOMED-CT ontology. With the usage of BoMT, an ontological reasoning method is proposed to represent EMR and clinical trials in a vector space model. The platform presents a matching process that reduces vector dimensionality using a neural network, then applies orthogonality projection to measure the similarity between vectors. Finally, the proposed EMR2vec platform is evaluated with an extendable prototype based on Big data tools.
Collapse
Affiliation(s)
| | - Rima Kilany
- Saint Joseph University, Mar Roukos, Beirut, Lebanon
| | - Rafiqul Haque
- Intelligencia, 66 Avenue des Champs Elysees, Paris, France
| | - Yehia Taher
- David lab, 45 Avenue des Etats Unis, Versailles, France
| |
Collapse
|
5
|
Liu C, Yuan C, Butler AM, Carvajal RD, Li ZR, Ta CN, Weng C. DQueST: dynamic questionnaire for search of clinical trials. J Am Med Inform Assoc 2021; 26:1333-1343. [PMID: 31390010 PMCID: PMC6798577 DOI: 10.1093/jamia/ocz121] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 05/31/2019] [Accepted: 06/18/2019] [Indexed: 11/27/2022] Open
Abstract
Objective Information overload remains a challenge for patients seeking clinical trials. We present a novel system (DQueST) that reduces information overload for trial seekers using dynamic questionnaires. Materials and Methods DQueST first performs information extraction and criteria library curation. DQueST transforms criteria narratives in the ClinicalTrials.gov repository into a structured format, normalizes clinical entities using standard concepts, clusters related criteria, and stores the resulting curated library. DQueST then implements a real-time dynamic question generation algorithm. During user interaction, the initial search is similar to a standard search engine, and then DQueST performs real-time dynamic question generation to select criteria from the library 1 at a time by maximizing its relevance score that reflects its ability to rule out ineligible trials. DQueST dynamically updates the remaining trial set by removing ineligible trials based on user responses to corresponding questions. The process iterates until users decide to stop and begin manually reviewing the remaining trials. Results In simulation experiments initiated by 10 diseases, DQueST reduced information overload by filtering out 60%–80% of initial trials after 50 questions. Reviewing the generated questions against previous answers, on average, 79.7% of the questions were relevant to the queried conditions. By examining the eligibility of random samples of trials ruled out by DQueST, we estimate the accuracy of the filtering procedure is 63.7%. In a study using 5 mock patient profiles, DQueST on average retrieved trials with a 1.465 times higher density of eligible trials than an existing search engine. In a patient-centered usability evaluation, patients found DQueST useful, easy to use, and returning relevant results. Conclusion DQueST contributes a novel framework for transforming free-text eligibility criteria to questions and filtering out clinical trials based on user answers to questions dynamically. It promises to augment keyword-based methods to improve clinical trial search.
Collapse
Affiliation(s)
- Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Chi Yuan
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Alex M Butler
- Department of Biomedical Informatics, Columbia University, New York, New York, USA.,Department of Medicine, Columbia University, New York, New York, USA
| | | | - Ziran Ryan Li
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Casey N Ta
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
6
|
Atal I, Zeitoun JD, Névéol A, Ravaud P, Porcher R, Trinquart L. Automatic classification of registered clinical trials towards the Global Burden of Diseases taxonomy of diseases and injuries. BMC Bioinformatics 2016; 17:392. [PMID: 27659604 PMCID: PMC5034670 DOI: 10.1186/s12859-016-1247-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Accepted: 09/08/2016] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Clinical trial registries may allow for producing a global mapping of health research. However, health conditions are not described with standardized taxonomies in registries. Previous work analyzed clinical trial registries to improve the retrieval of relevant clinical trials for patients. However, no previous work has classified clinical trials across diseases using a standardized taxonomy allowing a comparison between global health research and global burden across diseases. We developed a knowledge-based classifier of health conditions studied in registered clinical trials towards categories of diseases and injuries from the Global Burden of Diseases (GBD) 2010 study. The classifier relies on the UMLS® knowledge source (Unified Medical Language System®) and on heuristic algorithms for parsing data. It maps trial records to a 28-class grouping of the GBD categories by automatically extracting UMLS concepts from text fields and by projecting concepts between medical terminologies. The classifier allows deriving pathways between the clinical trial record and candidate GBD categories using natural language processing and links between knowledge sources, and selects the relevant GBD classification based on rules of prioritization across the pathways found. We compared automatic and manual classifications for an external test set of 2,763 trials. We automatically classified 109,603 interventional trials registered before February 2014 at WHO ICTRP. RESULTS In the external test set, the classifier identified the exact GBD categories for 78 % of the trials. It had very good performance for most of the 28 categories, especially "Neoplasms" (sensitivity 97.4 %, specificity 97.5 %). The sensitivity was moderate for trials not relevant to any GBD category (53 %) and low for trials of injuries (16 %). For the 109,603 trials registered at WHO ICTRP, the classifier did not assign any GBD category to 20.5 % of trials while the most common GBD categories were "Neoplasms" (22.8 %) and "Diabetes" (8.9 %). CONCLUSIONS We developed and validated a knowledge-based classifier allowing for automatically identifying the diseases studied in registered trials by using the taxonomy from the GBD 2010 study. This tool is freely available to the research community and can be used for large-scale public health studies.
Collapse
Affiliation(s)
- Ignacio Atal
- Centre d’Épidémiologie Clinique, Hôpital Hôtel-Dieu, Paris, France
- INSERM U1153, Paris, France
- Université Paris Descartes, Paris, France
| | - Jean-David Zeitoun
- Centre d’Épidémiologie Clinique, Hôpital Hôtel-Dieu, Paris, France
- INSERM U1153, Paris, France
- Université Paris Descartes, Paris, France
| | - Aurélie Névéol
- LIMSI, CNRS UPR 3251, Université Paris-Saclay, Orsay, France
| | - Philippe Ravaud
- Centre d’Épidémiologie Clinique, Hôpital Hôtel-Dieu, Paris, France
- INSERM U1153, Paris, France
- Université Paris Descartes, Paris, France
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY USA
| | - Raphaël Porcher
- Centre d’Épidémiologie Clinique, Hôpital Hôtel-Dieu, Paris, France
- INSERM U1153, Paris, France
- Université Paris Descartes, Paris, France
| | - Ludovic Trinquart
- Centre d’Épidémiologie Clinique, Hôpital Hôtel-Dieu, Paris, France
- INSERM U1153, Paris, France
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY USA
| |
Collapse
|
7
|
Hao T, Liu H, Weng C. Valx: A System for Extracting and Structuring Numeric Lab Test Comparison Statements from Text. Methods Inf Med 2016; 55:266-75. [PMID: 26940748 DOI: 10.3414/me15-01-0112] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Accepted: 02/07/2016] [Indexed: 01/08/2023]
Abstract
OBJECTIVES To develop an automated method for extracting and structuring numeric lab test comparison statements from text and evaluate the method using clinical trial eligibility criteria text. METHODS Leveraging semantic knowledge from the Unified Medical Language System (UMLS) and domain knowledge acquired from the Internet, Valx takes seven steps to extract and normalize numeric lab test expressions: 1) text preprocessing, 2) numeric, unit, and comparison operator extraction, 3) variable identification using hybrid knowledge, 4) variable - numeric association, 5) context-based association filtering, 6) measurement unit normalization, and 7) heuristic rule-based comparison statements verification. Our reference standard was the consensus-based annotation among three raters for all comparison statements for two variables, i.e., HbA1c and glucose, identified from all of Type 1 and Type 2 diabetes trials in ClinicalTrials.gov. RESULTS The precision, recall, and F-measure for structuring HbA1c comparison statements were 99.6%, 98.1%, 98.8% for Type 1 diabetes trials, and 98.8%, 96.9%, 97.8% for Type 2 diabetes trials, respectively. The precision, recall, and F-measure for structuring glucose comparison statements were 97.3%, 94.8%, 96.1% for Type 1 diabetes trials, and 92.3%, 92.3%, 92.3% for Type 2 diabetes trials, respectively. CONCLUSIONS Valx is effective at extracting and structuring free-text lab test comparison statements in clinical trial summaries. Future studies are warranted to test its generalizability beyond eligibility criteria text. The open-source Valx enables its further evaluation and continued improvement among the collaborative scientific community.
Collapse
Affiliation(s)
| | | | - Chunhua Weng
- Chunhua Weng, Ph.D., Department of Biomedical Informatics, Columbia University, New York City, 622 W 168th Street, PH-20, New York, NY 10032, USA, E-mail:
| |
Collapse
|
8
|
Effective Filtering of Query Results on Updated User Behavioral Profiles in Web Mining. ScientificWorldJournal 2015. [PMID: 26221626 PMCID: PMC4478364 DOI: 10.1155/2015/829126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Web with tremendous volume of information retrieves result for user related queries. With the rapid growth of web page recommendation, results retrieved based on data mining techniques did not offer higher performance filtering rate because relationships between user profile and queries were not analyzed in an extensive manner. At the same time, existing user profile based prediction in web data mining is not exhaustive in producing personalized result rate. To improve the query result rate on dynamics of user behavior over time, Hamilton Filtered Regime Switching User Query Probability (HFRS-UQP) framework is proposed. HFRS-UQP framework is split into two processes, where filtering and switching are carried out. The data mining based filtering in our research work uses the Hamilton Filtering framework to filter user result based on personalized information on automatic updated profiles through search engine. Maximized result is fetched, that is, filtered out with respect to user behavior profiles. The switching performs accurate filtering updated profiles using regime switching. The updating in profile change (i.e., switches) regime in HFRS-UQP framework identifies the second- and higher-order association of query result on the updated profiles. Experiment is conducted on factors such as personalized information search retrieval rate, filtering efficiency, and precision ratio.
Collapse
|
9
|
A Semantic Web-based System for Mining Genetic Mutations in Cancer Clinical Trials. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2015; 2015:142-6. [PMID: 26306257 PMCID: PMC4525254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Textual eligibility criteria in clinical trial protocols contain important information about potential clinically relevant pharmacogenomic events. Manual curation for harvesting this evidence is intractable as it is error prone and time consuming. In this paper, we develop and evaluate a Semantic Web-based system that captures and manages mutation evidences and related contextual information from cancer clinical trials. The system has 2 main components: an NLP-based annotator and a Semantic Web ontology-based annotation manager. We evaluated the performance of the annotator in terms of precision and recall. We demonstrated the usefulness of the system by conducting case studies in retrieving relevant clinical trials using a collection of mutations identified from TCGA Leukemia patients and Atlas of Genetics and Cytogenetics in Oncology and Haematology. In conclusion, our system using Semantic Web technologies provides an effective framework for extraction, annotation, standardization and management of genetic mutations in cancer clinical trials.
Collapse
|
10
|
Miotto R, Weng C. Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials. J Am Med Inform Assoc 2015; 22:e141-50. [PMID: 25769682 PMCID: PMC4428438 DOI: 10.1093/jamia/ocu050] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Accepted: 12/16/2014] [Indexed: 11/12/2022] Open
Abstract
Objective To develop a cost-effective, case-based reasoning framework for clinical research eligibility screening by only reusing the electronic health records (EHRs) of minimal enrolled participants to represent the target patient for each trial under consideration. Materials and Methods The EHR data—specifically diagnosis, medications, laboratory results, and clinical notes—of known clinical trial participants were aggregated to profile the “target patient” for a trial, which was used to discover new eligible patients for that trial. The EHR data of unseen patients were matched to this “target patient” to determine their relevance to the trial; the higher the relevance, the more likely the patient was eligible. Relevance scores were a weighted linear combination of cosine similarities computed over individual EHR data types. For evaluation, we identified 262 participants of 13 diversified clinical trials conducted at Columbia University as our gold standard. We ran a 2-fold cross validation with half of the participants used for training and the other half used for testing along with other 30 000 patients selected at random from our clinical database. We performed binary classification and ranking experiments. Results The overall area under the ROC curve for classification was 0.95, enabling the highlight of eligible patients with good precision. Ranking showed satisfactory results especially at the top of the recommended list, with each trial having at least one eligible patient in the top five positions. Conclusions This relevance-based method can potentially be used to identify eligible patients for clinical trials by processing patient EHR data alone without parsing free-text eligibility criteria, and shows promise of efficient “case-based reasoning” modeled only on minimal trial participants.
Collapse
Affiliation(s)
| | - Chunhua Weng
- Department of Biomedical Informatics The Irving Institute for Clinical and Translational Research, Columbia University, New York, NY 10032, USA
| |
Collapse
|
11
|
Visual aggregate analysis of eligibility features of clinical trials. J Biomed Inform 2015; 54:241-55. [PMID: 25615940 DOI: 10.1016/j.jbi.2015.01.005] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2014] [Revised: 11/23/2014] [Accepted: 01/12/2015] [Indexed: 12/20/2022]
Abstract
OBJECTIVE To develop a method for profiling the collective populations targeted for recruitment by multiple clinical studies addressing the same medical condition using one eligibility feature each time. METHODS Using a previously published database COMPACT as the backend, we designed a scalable method for visual aggregate analysis of clinical trial eligibility features. This method consists of four modules for eligibility feature frequency analysis, query builder, distribution analysis, and visualization, respectively. This method is capable of analyzing (1) frequently used qualitative and quantitative features for recruiting subjects for a selected medical condition, (2) distribution of study enrollment on consecutive value points or value intervals of each quantitative feature, and (3) distribution of studies on the boundary values, permissible value ranges, and value range widths of each feature. All analysis results were visualized using Google Charts API. Five recruited potential users assessed the usefulness of this method for identifying common patterns in any selected eligibility feature for clinical trial participant selection. RESULTS We implemented this method as a Web-based analytical system called VITTA (Visual Analysis Tool of Clinical Study Target Populations). We illustrated the functionality of VITTA using two sample queries involving quantitative features BMI and HbA1c for conditions "hypertension" and "Type 2 diabetes", respectively. The recruited potential users rated the user-perceived usefulness of VITTA with an average score of 86.4/100. CONCLUSIONS We contributed a novel aggregate analysis method to enable the interrogation of common patterns in quantitative eligibility criteria and the collective target populations of multiple related clinical studies. A larger-scale study is warranted to formally assess the usefulness of VITTA among clinical investigators and sponsors in various therapeutic areas.
Collapse
|
12
|
Hao T, Weng C. Adaptive semantic tag mining from heterogeneous clinical research texts. Methods Inf Med 2014; 54:164-70. [PMID: 25327613 DOI: 10.3414/me13-01-0130] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2013] [Accepted: 09/15/2014] [Indexed: 01/27/2023]
Abstract
OBJECTIVES To develop an adaptive approach to mine frequent semantic tags (FSTs) from heterogeneous clinical research texts. METHODS We develop a "plug-n-play" framework that integrates replaceable unsupervised kernel algorithms with formatting, functional, and utility wrappers for FST mining. Temporal information identification and semantic equivalence detection were two example functional wrappers. We first compared this approach's recall and efficiency for mining FSTs from ClinicalTrials.gov to that of a recently published tag-mining algorithm. Then we assessed this approach's adaptability to two other types of clinical research texts: clinical data requests and clinical trial protocols, by comparing the prevalence trends of FSTs across three texts. RESULTS Our approach increased the average recall and speed by 12.8% and 47.02% respectively upon the baseline when mining FSTs from ClinicalTrials.gov, and maintained an overlap in relevant FSTs with the base- line ranging between 76.9% and 100% for varying FST frequency thresholds. The FSTs saturated when the data size reached 200 documents. Consistent trends in the prevalence of FST were observed across the three texts as the data size or frequency threshold changed. CONCLUSIONS This paper contributes an adaptive tag-mining framework that is scalable and adaptable without sacrificing its recall. This component-based architectural design can be potentially generalizable to improve the adaptability of other clinical text mining methods.
Collapse
Affiliation(s)
| | - C Weng
- Chunhua Weng, Ph.D., Associate Professor, Department of Biomedical Informatics, Columbia University, 622 W 168 Street, PH-20, New York, NY, 10032, USA, E-mail:
| |
Collapse
|
13
|
Weng C, Li Y, Ryan P, Zhang Y, Liu F, Gao J, Bigger JT, Hripcsak G. A distribution-based method for assessing the differences between clinical trial target populations and patient populations in electronic health records. Appl Clin Inform 2014; 5:463-79. [PMID: 25024761 DOI: 10.4338/aci-2013-12-ra-0105] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2013] [Accepted: 04/09/2014] [Indexed: 12/19/2022] Open
Abstract
OBJECTIVE To improve the transparency of clinical trial generalizability and to illustrate the method using Type 2 diabetes as an example. METHODS Our data included 1,761 diabetes clinical trials and the electronic health records (EHR) of 26,120 patients with Type 2 diabetes who visited Columbia University Medical Center of New-York Presbyterian Hospital. The two populations were compared using the Generalizability Index for Study Traits (GIST) on the earliest diagnosis age and the mean hemoglobin A1c (HbA1c) values. RESULTS Greater than 70% of Type 2 diabetes studies allow patients with HbA1c measures between 7 and 10.5, but less than 40% of studies allow HbA1c<7 and fewer than 45% of studies allow HbA1c>10.5. In the real-world population, only 38% of patients had HbA1c between 7 and 10.5, with 12% having values above the range and 52% having HbA1c<7. The GIST for HbA1c was 0.51. Most studies adopted broad age value ranges, with the most common restrictions excluding patients >80 or <18 years. Most of the real-world population fell within this range, but 2% of patients were <18 at time of first diagnosis and 8% were >80. The GIST for age was 0.75. CONCLUSIONS We contribute a scalable method to profile and compare aggregated clinical trial target populations with EHR patient populations. We demonstrate that Type 2 diabetes studies are more generalizable with regard to age than they are with regard to HbA1c. We found that the generalizability of age increased from Phase 1 to Phase 3 while the generalizability of HbA1c decreased during those same phases. This method can generalize to other medical conditions and other continuous or binary variables. We envision the potential use of EHR data for examining the generalizability of clinical trials and for defining population-representative clinical trial eligibility criteria.
Collapse
Affiliation(s)
- C Weng
- Department of Biomedical Informatics, Columbia University , New York, NY 10032
| | - Y Li
- Department of Computer Science, City College of New York , New York, NY 10031
| | - P Ryan
- Janssen Research and Development , Titusville, New Jersey, 08560 ; Observational Health Data Sciences and Informatics , New York, NY, 10032
| | - Y Zhang
- Department of Biostatistics, Columbia University , New York, NY 10032
| | - F Liu
- Department of Biomedical Informatics, Columbia University , New York, NY 10032
| | - J Gao
- Business School, Columbia University , New York, NY 10025
| | - J T Bigger
- Department of Medicine, Columbia University , New York, NY 10032
| | - G Hripcsak
- Department of Biomedical Informatics, Columbia University , New York, NY 10032
| |
Collapse
|
14
|
Jiang SY, Weng C. Cross-system evaluation of clinical trial search engines. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2014; 2014:223-9. [PMID: 25954590 PMCID: PMC4419768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Clinical trials are fundamental to the advancement of medicine but constantly face recruitment difficulties. Various clinical trial search engines have been designed to help health consumers identify trials for which they may be eligible. Unfortunately, knowledge of the usefulness and usability of their designs remains scarce. In this study, we used mixed methods, including time-motion analysis, think-aloud protocol, and survey, to evaluate five popular clinical trial search engines with 11 users. Differences in user preferences and time spent on each system were observed and correlated with user characteristics. In general, searching for applicable trials using these systems is a cognitively demanding task. Our results show that user perceptions of these systems are multifactorial. The survey indicated eTACTS being the generally preferred system, but this finding did not persist among all mixed methods. This study confirms the value of mixed-methods for a comprehensive system evaluation. Future system designers must be aware that different users groups expect different functionalities.
Collapse
Affiliation(s)
- Silis Y Jiang
- Department of Biomedical Informatics, Columbia University, New York, NY 10032
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032
| |
Collapse
|
15
|
Hao T, Rusanov A, Boland MR, Weng C. Clustering clinical trials with similar eligibility criteria features. J Biomed Inform 2014; 52:112-20. [PMID: 24496068 DOI: 10.1016/j.jbi.2014.01.009] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2013] [Revised: 01/15/2014] [Accepted: 01/24/2014] [Indexed: 10/25/2022]
Abstract
OBJECTIVES To automatically identify and cluster clinical trials with similar eligibility features. METHODS Using the public repository ClinicalTrials.gov as the data source, we extracted semantic features from the eligibility criteria text of all clinical trials and constructed a trial-feature matrix. We calculated the pairwise similarities for all clinical trials based on their eligibility features. For all trials, by selecting one trial as the center each time, we identified trials whose similarities to the central trial were greater than or equal to a predefined threshold and constructed center-based clusters. Then we identified unique trial sets with distinctive trial membership compositions from center-based clusters by disregarding their structural information. RESULTS From the 145,745 clinical trials on ClinicalTrials.gov, we extracted 5,508,491 semantic features. Of these, 459,936 were unique and 160,951 were shared by at least one pair of trials. Crowdsourcing the cluster evaluation using Amazon Mechanical Turk (MTurk), we identified the optimal similarity threshold, 0.9. Using this threshold, we generated 8806 center-based clusters. Evaluation of a sample of the clusters by MTurk resulted in a mean score 4.331±0.796 on a scale of 1-5 (5 indicating "strongly agree that the trials in the cluster are similar"). CONCLUSIONS We contribute an automated approach to clustering clinical trials with similar eligibility features. This approach can be potentially useful for investigating knowledge reuse patterns in clinical trial eligibility criteria designs and for improving clinical trial recruitment. We also contribute an effective crowdsourcing method for evaluating informatics interventions.
Collapse
Affiliation(s)
- Tianyong Hao
- Department of Biomedical Informatics, Columbia University, New York, NY, United States
| | - Alexander Rusanov
- Department of Anesthesiology, Columbia University, New York, NY, United States
| | - Mary Regina Boland
- Department of Biomedical Informatics, Columbia University, New York, NY, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, United States.
| |
Collapse
|