1
|
Lee SJ, Kim JH. Applying Sequential Pattern Mining to Investigate the Temporal Relationships between Commonly Occurring Internal Medicine Diseases and Intervals for the Risk of Concurrent Disease in Canine Patients. Animals (Basel) 2023; 13:3359. [PMID: 37958114 PMCID: PMC10647901 DOI: 10.3390/ani13213359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/29/2023] [Accepted: 10/27/2023] [Indexed: 11/15/2023] Open
Abstract
Sequential pattern mining (SPM) is a data mining technique used for identifying common association rules in multiple sequential datasets and patterns in ordered events. In this study, we aimed to identify the relationships between commonly occurring internal medicine diseases in canine patients. We obtained medical records of dogs referred to the Konkuk University Veterinary Medicine Teaching Hospital. The data used for SPM included comorbidities and intervals between the diagnoses of internal medicine diseases. Additionally, we estimated the 3-year risk of developing an additional disease after the initial diagnosis of a commonly occurring veterinary internal medicine disease using logistic regression. We identified 547 canine patients diagnosed with ≥ 1 internal medicine disease. The SPM-based analysis assessed comorbidities and intervals for each of the five most common internal medical diseases, including hyperadrenocorticism, myxomatous mitral valve disease, canine atopic dermatitis, chronic kidney disease, and chronic pancreatitis. The highest values of the association rule were 3.01%, 6.02%, 3.9%, 4.1%, and 4.84%, and the shortest intervals were 1.64, 13.14, 5.37, 17.02, and 1.7 days, respectively. This study proposes that SPM is an effective technique for identifying common associations and temporal relationships between internal medicine diseases, and can be used to assess the probability of additional admission due to the development of the subsequent disease that may be diagnosed in canine patients. The results of this study will help veterinarians suggest appropriate preventive measures or other medical treatments for canine patients with medical conditions that have not yet been diagnosed, but are likely to develop in the short term.
Collapse
Affiliation(s)
- Suk-Jun Lee
- Department of Business Management, Kwangwoon University, 536 Nuri-Hall, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea;
| | - Jung-Hyun Kim
- Department of Veterinary Internal Medicine, College of Veterinary Medicine, Konkuk University, #120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
| |
Collapse
|
2
|
Zhu Y, Venugopalan J, Zhang Z, Chanani NK, Maher KO, Wang MD. Domain Adaptation Using Convolutional Autoencoder and Gradient Boosting for Adverse Events Prediction in the Intensive Care Unit. Front Artif Intell 2022; 5:640926. [PMID: 35481281 PMCID: PMC9036368 DOI: 10.3389/frai.2022.640926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 02/23/2022] [Indexed: 11/13/2022] Open
Abstract
More than 5 million patients have admitted annually to intensive care units (ICUs) in the United States. The leading causes of mortality are cardiovascular failures, multi-organ failures, and sepsis. Data-driven techniques have been used in the analysis of patient data to predict adverse events, such as ICU mortality and ICU readmission. These models often make use of temporal or static features from a single ICU database to make predictions on subsequent adverse events. To explore the potential of domain adaptation, we propose a method of data analysis using gradient boosting and convolutional autoencoder (CAE) to predict significant adverse events in the ICU, such as ICU mortality and ICU readmission. We demonstrate our results from a retrospective data analysis using patient records from a publicly available database called Multi-parameter Intelligent Monitoring in Intensive Care-II (MIMIC-II) and a local database from Children's Healthcare of Atlanta (CHOA). We demonstrate that after adopting novel data imputation on patient ICU data, gradient boosting is effective in both the mortality prediction task and the ICU readmission prediction task. In addition, we use gradient boosting to identify top-ranking temporal and non-temporal features in both prediction tasks. We discuss the relationship between these features and the specific prediction task. Lastly, we indicate that CAE might not be effective in feature extraction on one dataset, but domain adaptation with CAE feature extraction across two datasets shows promising results.
Collapse
Affiliation(s)
- Yuanda Zhu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Janani Venugopalan
- Biomedical Engineering Department, Georgia Institute of Technology, Emory University, Atlanta, GA, United States
| | - Zhenyu Zhang
- Biomedical Engineering Department, Georgia Institute of Technology, Atlanta, GA, United States
- Department of Biomedical Engineering, Peking University, Beijing, China
| | | | - Kevin O. Maher
- Pediatrics Department, Emory University, Atlanta, GA, United States
| | - May D. Wang
- Biomedical Engineering Department, Georgia Institute of Technology, Emory University, Atlanta, GA, United States
- *Correspondence: May D. Wang
| |
Collapse
|
3
|
Brown KA, Sarkar IN, Chen ES. Mental Health Comorbidity Analysis in Pediatric Patients with Autism Spectrum Disorder Using Rhode Island Medical Claims Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:263-272. [PMID: 33936398 PMCID: PMC8075466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Identification of comorbidity subgroups linked with Autism Spectrum Disorder (ASD) could provide promising insight into learning more about this disorder. This study sought to use the Rhode Island All-Payer Claims Database to examine mental health conditions linked to ASD. Medical claims data for ASD patients and one or more mental health conditions were analyzed using descriptive statistics, association rule mining (ARM), and sequential pattern mining (SPM). The results indicated that patients with ASD have a higher proportion of mental health diagnoses than the general pediatric population. ARM and SPM methods identified patterns of comorbidities commonly seen among ASD patients. Based on the observed patterns and temporal sequences, suicidal ideation, mood disorders, anxiety, and conduct disorders may need focused attention prospectively. Understanding more about groupings of ASD patients and their comorbidity burden can help bridge gaps in knowledge and make strides toward improved outcomes for patients with ASD.
Collapse
Affiliation(s)
- Katherine A Brown
- Brown Center for Biomedical Informatics, Brown University, Providence RI
| | - Indra Neil Sarkar
- Brown Center for Biomedical Informatics, Brown University, Providence RI
- Rhode Island Quality Institute, Providence RI
| | - Elizabeth S Chen
- Brown Center for Biomedical Informatics, Brown University, Providence RI
| |
Collapse
|
4
|
Morid MA, Sheng ORL, Kawamoto K, Abdelrahman S. Learning hidden patterns from patient multivariate time series data using convolutional neural networks: A case study of healthcare cost prediction. J Biomed Inform 2020; 111:103565. [DOI: 10.1016/j.jbi.2020.103565] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 08/27/2020] [Accepted: 09/07/2020] [Indexed: 01/20/2023]
|
5
|
|
6
|
Kocheturov A, Momcilovic P, Bihorac A, Pardalos PM. Extended vertical lists for temporal pattern mining from multivariate time series. EXPERT SYSTEMS 2019; 36:e12448. [PMID: 33162636 PMCID: PMC7646935 DOI: 10.1111/exsy.12448] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 05/10/2019] [Indexed: 06/11/2023]
Abstract
In this paper, the problem of mining complex temporal patterns in the context of multivariate time series is considered. A new method called the Fast Temporal Pattern Mining with Extended Vertical Lists is introduced. The method is based on an extension of the level-wise property, which requires a more complex pattern to start at positions within a record where all of the subpatterns of the pattern start. The approach is built around a novel data structure called the Extended Vertical List that tracks positions of the first state of the pattern inside records and links them to appropriate positions of a specific subpattern of the pattern called the prefix. Extensive computational results indicate that the new method performs significantly faster than the previous version of the algorithm for Temporal Pattern Mining; however, the increase in speed comes at the expense of increased memory usage.
Collapse
Affiliation(s)
- Anton Kocheturov
- Center for Applied Optimization, Industrial and Systems Engineering, University of Florida, Gainesville, Florida
| | - Petar Momcilovic
- Industrial and Systems Engineering, University of Florida, Gainesville, Florida
| | - Azra Bihorac
- Division of Nephrology, Hypertension, and Renal Transplantation, University of Florida, Gainesville, Florida
| | - Panos M. Pardalos
- Center for Applied Optimization, Industrial and Systems Engineering, University of Florida, Gainesville, Florida
| |
Collapse
|
7
|
R D, P R. An Optimized HCC Recurrence Prediction Using APO Algorithm Multiple Time Series Clinical Liver Cancer Dataset. J Med Syst 2019; 43:193. [PMID: 31115780 DOI: 10.1007/s10916-019-1265-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2019] [Accepted: 03/28/2019] [Indexed: 12/16/2022]
Abstract
The classification of recurrence and non recurrence of Hepato Cellular carcinoma (HCC) outcome after Radio Frequency Ablation therapy is a critical task. Multiple time series clinical liver cancer dataset is collected from different dataset and time interval. A merging algorithm is used to merge all attributes collected from different sources in multiple time periods. In order to preserve the originality of information, statistical measures of each attribute is calculated and considered them as additional attributes for accurate prediction. However the merged dataset is unbalanced, in which, the number of samples from HCC recurrence class is much smaller than from HCC non recurrence. The feature weighting scheme select optimal features and parameter of classifiers are sequentially obtained from multiple iterations which causes higher computation time. In this paper, an efficient sampling approach is proposed using Inverse Random under Sampling (IRUS) to overcome class imbalance issue. IRUS under sample the majority class which creates a number of distinct partitions with a boundary separated minority and majority class samples. Additionally an optimization approach is proposed using Artificial Plant Optimization (APO) algorithm to select optimal features and parameters of classifiers to improve the effectiveness and efficiency of classification. The optimization approach reduces the number of iteration and computation time for feature selection and parameter selection for classifiers which classify the recurrence and non recurrence of HCC. Classify patients with HCC and without HCC based on optimal features and parameters by Support Vector Machine (SVM) and Random Forest(RF) classifiers. Finally, the experimental results are conducted to prove the effectiveness of the proposed method over existing method in terms of accuracy, specificity, sensitivity and balanced accuracy.
Collapse
Affiliation(s)
- Divya R
- Research Scholar, PG and Research Department of Computer Science, Govt Arts College(Autonomous), Coimbatore, Tamil Nadu, India.
| | - Radha P
- Assistant Professor, PG and Research Department of Computer Science, Govt Arts College(Autonomous), Coimbatore, Tamil Nadu, India
| |
Collapse
|
8
|
Levine ME, Albers DJ, Hripcsak G. Methodological variations in lagged regression for detecting physiologic drug effects in EHR data. J Biomed Inform 2018; 86:149-159. [PMID: 30172760 DOI: 10.1016/j.jbi.2018.08.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Revised: 07/20/2018] [Accepted: 08/29/2018] [Indexed: 12/22/2022]
Abstract
We studied how lagged linear regression can be used to detect the physiologic effects of drugs from data in the electronic health record (EHR). We systematically examined the effect of methodological variations ((i) time series construction, (ii) temporal parameterization, (iii) intra-subject normalization, (iv) differencing (lagged rates of change achieved by taking differences between consecutive measurements), (v) explanatory variables, and (vi) regression models) on performance of lagged linear methods in this context. We generated two gold standards (one knowledge-base derived, one expert-curated) for expected pairwise relationships between 7 drugs and 4 labs, and evaluated how the 64 unique combinations of methodological perturbations reproduce the gold standards. Our 28 cohorts included patients in the Columbia University Medical Center/NewYork-Presbyterian Hospital clinical database, and ranged from 2820 to 79,514 patients with between 8 and 209 average time points per patient. The most accurate methods achieved AUROC of 0.794 for knowledge-base derived gold standard (95%CI [0.741, 0.847]) and 0.705 for expert-curated gold standard (95% CI [0.629, 0.781]). We observed a mean AUROC of 0.633 (95%CI [0.610, 0.657], expert-curated gold standard) across all methods that re-parameterize time according to sequence and use either a joint autoregressive model with time-series differencing or an independent lag model without differencing. The complement of this set of methods achieved a mean AUROC close to 0.5, indicating the importance of these choices. We conclude that time-series analysis of EHR data will likely rely on some of the beneficial pre-processing and modeling methodologies identified, and will certainly benefit from continued careful analysis of methodological perturbations. This study found that methodological variations, such as pre-processing and representations, have a large effect on results, exposing the importance of thoroughly evaluating these components when comparing machine-learning methods.
Collapse
Affiliation(s)
- Matthew E Levine
- Department of Biomedical Informatics, Columbia University Medical Center, 622 W. 168th Street, Presbyterian Building 20th Floor, New York, NY 10032, United States; Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States.
| | - David J Albers
- Department of Biomedical Informatics, Columbia University Medical Center, 622 W. 168th Street, Presbyterian Building 20th Floor, New York, NY 10032, United States; Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, 622 W. 168th Street, Presbyterian Building 20th Floor, New York, NY 10032, United States; Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States; NewYork-Presbyterian Hospital, 622 W. 168th Street, New York, NY 10032, United States
| |
Collapse
|
9
|
Hoffman RA, Venugopalan J, Qu L, Wu H, Wang MD. Improving Validity of Cause of Death on Death Certificates. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2018; 2018:178-183. [PMID: 32558825 PMCID: PMC7302107 DOI: 10.1145/3233547.3233581] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Accurate reporting of causes of death on death certificates is essential to formulate appropriate disease control, prevention and emergency response by national health-protection institutions such as Center for disease prevention and control (CDC). In this study, we utilize knowledge from publicly available expert-formulated rules for the cause of death to determine the extent of discordance in the death certificates in national mortality data with the expert knowledge base. We also report the most commonly occurring invalid causal pairs which physicians put in the death certificates. We use sequence rule mining to find patterns that are most frequent on death certificates and compare them with the rules from the expert knowledge based. Based on our results, 20.1% of the common patterns derived from entries into death certificates were discordant. The most probable causes of these discordance or invalid rules are missing steps and non-specific ICD-10 codes on the death certificates.
Collapse
Affiliation(s)
- Ryan A Hoffman
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Janani Venugopalan
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Li Qu
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Hang Wu
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - May D Wang
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
10
|
Hoffman RA, Wu H, Venugopalan J, Braun P, Wang MD. Intelligent Mortality Reporting With FHIR. IEEE J Biomed Health Inform 2018; 22:1583-1588. [PMID: 29993991 DOI: 10.1109/jbhi.2017.2780891] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
One pressing need in the area of public health is timely, accurate, and complete reporting of deaths and the diseases or conditions leading up to them. Fast Healthcare Interoperability Resources (FHIR) is a new HL7 interoperability standard for electronic health record, while Sustainable Medical Applications and Reusable Technologies (SMART)-on-FHIR enables third-party app development that can work "out of the box." This paper demonstrates the feasibility of developing SMART-on-FHIR applications that enables medical professionals to perform timely and accurate death reporting within multiple different USA State jurisdictions. We explored how the information on a standard certificate of death can be mapped to resources defined in the FHIR standard Draft Standard for Trial Use Version 2 and common profiles. We also demonstrated analytics for potentially improving the accuracy and completeness of mortality reporting data.
Collapse
|
11
|
Hripcsak G, Albers DJ. High-fidelity phenotyping: richness and freedom from bias. J Am Med Inform Assoc 2018; 25:289-294. [PMID: 29040596 PMCID: PMC7282504 DOI: 10.1093/jamia/ocx110] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2017] [Revised: 08/07/2017] [Accepted: 09/06/2017] [Indexed: 01/14/2023] Open
Abstract
Electronic health record phenotyping is the use of raw electronic health record data to assert characterizations about patients. Researchers have been doing it since the beginning of biomedical informatics, under different names. Phenotyping will benefit from an increasing focus on fidelity, both in the sense of increasing richness, such as measured levels, degree or severity, timing, probability, or conceptual relationships, and in the sense of reducing bias. Research agendas should shift from merely improving binary assignment to studying and improving richer representations. The field is actively researching new temporal directions and abstract representations, including deep learning. The field would benefit from research in nonlinear dynamics, in combining mechanistic models with empirical data, including data assimilation, and in topology. The health care process produces substantial bias, and studying that bias explicitly rather than treating it as merely another source of noise would facilitate addressing it.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, New York, NY, USA
| | - David J Albers
- Department of Biomedical Informatics, Columbia University Medical Center, New York, NY, USA
| |
Collapse
|
12
|
Hoffman RA, Wu H, Venugopalan J, Braun P, Wang MD. Intelligent Mortality Reporting with FHIR. ... IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS. IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS 2017; 2017:181-184. [PMID: 28804791 DOI: 10.1109/bhi.2017.7897235] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
One pressing need in the area of public health is timely, accurate, and complete reporting of deaths and the conditions leading up to them. Fast Healthcare Interoperability Resources (FHIR) is a new HL7 interoperability standard for electronic health record (EHR), while Sustainable Medical Applications and Reusable Technologies (SMART)-on-FHIR enables third-party app development that can work "out of the box". This research demonstrates the feasibility of developing SMART-on-FHIR applications to enable medical professionals to perform timely and accurate death reporting within multiple different jurisdictions of US. We explored how the information on a standard certificate of death can be mapped to resources defined in the FHIR standard (DSTU2). We also demonstrated analytics for potentially improving the accuracy and completeness of mortality reporting data.
Collapse
Affiliation(s)
- Ryan A Hoffman
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332 USA
| | - Hang Wu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
| | - Janani Venugopalan
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332 USA
| | - Paula Braun
- Centers for Disease Control and Prevention (CDC), Atlanta, GA 30329 USA
| | - May D Wang
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332 USA
| |
Collapse
|
13
|
Levine ME, Albers DJ, Hripcsak G. Comparing lagged linear correlation, lagged regression, Granger causality, and vector autoregression for uncovering associations in EHR data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:779-788. [PMID: 28269874 PMCID: PMC5333294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Time series analysis methods have been shown to reveal clinical and biological associations in data collected in the electronic health record. We wish to develop reliable high-throughput methods for identifying adverse drug effects that are easy to implement and produce readily interpretable results. To move toward this goal, we used univariate and multivariate lagged regression models to investigate associations between twenty pairs of drug orders and laboratory measurements. Multivariate lagged regression models exhibited higher sensitivity and specificity than univariate lagged regression in the 20 examples, and incorporating autoregressive terms for labs and drugs produced more robust signals in cases of known associations among the 20 example pairings. Moreover, including inpatient admission terms in the model attenuated the signals for some cases of unlikely associations, demonstrating how multivariate lagged regression models' explicit handling of context-based variables can provide a simple way to probe for health-care processes that confound analyses of EHR data.
Collapse
Affiliation(s)
- Matthew E Levine
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - David J Albers
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
14
|
Balasubramanian A, Shamsuddin R, Prabhakaran B, Sawant A. Predictive modeling of respiratory tumor motion for real-time prediction of baseline shifts. Phys Med Biol 2017; 62:1791-1809. [PMID: 28075331 DOI: 10.1088/1361-6560/aa58c3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Baseline shifts in respiratory patterns can result in significant spatiotemporal changes in patient anatomy (compared to that captured during simulation), in turn, causing geometric and dosimetric errors in the administration of thoracic and abdominal radiotherapy. We propose predictive modeling of the tumor motion trajectories for predicting a baseline shift ahead of its occurrence. The key idea is to use the features of the tumor motion trajectory over a 1 min window, and predict the occurrence of a baseline shift in the 5 s that immediately follow (lookahead window). In this study, we explored a preliminary trend-based analysis with multi-class annotations as well as a more focused binary classification analysis. In both analyses, a number of different inter-fraction and intra-fraction training strategies were studied, both offline as well as online, along with data sufficiency and skew compensation for class imbalances. The performance of different training strategies were compared across multiple machine learning classification algorithms, including nearest neighbor, Naïve Bayes, linear discriminant and ensemble Adaboost. The prediction performance is evaluated using metrics such as accuracy, precision, recall and the area under the curve (AUC) for repeater operating characteristics curve. The key results of the trend-based analysis indicate that (i) intra-fraction training strategies achieve highest prediction accuracies (90.5-91.4%); (ii) the predictive modeling yields lowest accuracies (50-60%) when the training data does not include any information from the test patient; (iii) the prediction latencies are as low as a few hundred milliseconds, and thus conducive for real-time prediction. The binary classification performance is promising, indicated by high AUCs (0.96-0.98). It also confirms the utility of prior data from previous patients, and also the necessity of training the classifier on some initial data from the new patient for reasonable prediction performance. The ability to predict a baseline shift with a sufficient look-ahead window will enable clinical systems or even human users to hold the treatment beam in such situations, thereby reducing the probability of serious geometric and dosimetric errors.
Collapse
Affiliation(s)
- A Balasubramanian
- Department of Computer Science, The University of Texas at Dallas, 800 W Cambell Road, Richardson, TX, United States of America
| | | | | | | |
Collapse
|
15
|
Batal I, Cooper G, Fradkin D, Harrison J, Moerchen F, Hauskrecht M. An Efficient Pattern Mining Approach for Event Detection in Multivariate Temporal Data. Knowl Inf Syst 2016; 46:115-150. [PMID: 26752800 PMCID: PMC4704806 DOI: 10.1007/s10115-015-0819-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2013] [Revised: 08/31/2014] [Accepted: 12/06/2014] [Indexed: 11/27/2022]
Abstract
This work proposes a pattern mining approach to learn event detection models from complex multivariate temporal data, such as electronic health records. We present Recent Temporal Pattern mining, a novel approach for efficiently finding predictive patterns for event detection problems. This approach first converts the time series data into time-interval sequences of temporal abstractions. It then constructs more complex time-interval patterns backward in time using temporal operators. We also present the Minimal Predictive Recent Temporal Patterns framework for selecting a small set of predictive and non-spurious patterns. We apply our methods for predicting adverse medical events in real-world clinical data. The results demonstrate the benefits of our methods in learning accurate event detection models, which is a key step for developing intelligent patient monitoring and decision support systems.
Collapse
Affiliation(s)
| | - Gregory Cooper
- Department of Biomedical Informatics, University of Pittsburgh,
| | | | - James Harrison
- Department of Public Health Sciences, University of Virginia,
| | | | | |
Collapse
|
16
|
Tseng YJ, Ping XO, Liang JD, Yang PM, Huang GT, Lai F. Multiple-Time-Series Clinical Data Processing for Classification With Merging Algorithm and Statistical Measures. IEEE J Biomed Health Inform 2015; 19:1036-43. [PMID: 25222960 DOI: 10.1109/jbhi.2014.2357719] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A description of patient conditions should consist of the changes in and combination of clinical measures. Traditional data-processing method and classification algorithms might cause clinical information to disappear and reduce prediction performance. To improve the accuracy of clinical-outcome prediction by using multiple measurements, a new multiple-time-series data-processing algorithm with period merging is proposed. Clinical data from 83 hepatocellular carcinoma (HCC) patients were used in this research. Their clinical reports from a defined period were merged using the proposed merging algorithm, and statistical measures were also calculated. After data processing, multiple measurements support vector machine (MMSVM) with radial basis function (RBF) kernels was used as a classification method to predict HCC recurrence. A multiple measurements random forest regression (MMRF) was also used as an additional evaluation/classification method. To evaluate the data-merging algorithm, the performance of prediction using processed multiple measurements was compared to prediction using single measurements. The results of recurrence prediction by MMSVM with RBF using multiple measurements and a period of 120 days (accuracy 0.771, balanced accuracy 0.603) were optimal, and their superiority to the results obtained using single measurements was statistically significant (accuracy 0.626, balanced accuracy 0.459, P < 0.01). In the cases of MMRF, the prediction results obtained after applying the proposed merging algorithm were also better than single-measurement results (P < 0.05). The results show that the performance of HCC-recurrence prediction was significantly improved when the proposed data-processing algorithm was used, and that multiple measurements could be of greater value than single.
Collapse
|
17
|
Hripcsak G, Albers DJ, Perotte A. Parameterizing time in electronic health record studies. J Am Med Inform Assoc 2015; 22:794-804. [PMID: 25725004 PMCID: PMC6169471 DOI: 10.1093/jamia/ocu051] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Revised: 11/08/2014] [Accepted: 12/22/2014] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Fields like nonlinear physics offer methods for analyzing time series, but many methods require that the time series be stationary-no change in properties over time.Objective Medicine is far from stationary, but the challenge may be able to be ameliorated by reparameterizing time because clinicians tend to measure patients more frequently when they are ill and are more likely to vary. METHODS We compared time parameterizations, measuring variability of rate of change and magnitude of change, and looking for homogeneity of bins of temporal separation between pairs of time points. We studied four common laboratory tests drawn from 25 years of electronic health records on 4 million patients. RESULTS We found that sequence time-that is, simply counting the number of measurements from some start-produced more stationary time series, better explained the variation in values, and had more homogeneous bins than either traditional clock time or a recently proposed intermediate parameterization. Sequence time produced more accurate predictions in a single Gaussian process model experiment. CONCLUSIONS Of the three parameterizations, sequence time appeared to produce the most stationary series, possibly because clinicians adjust their sampling to the acuity of the patient. Parameterizing by sequence time may be applicable to association and clustering experiments on electronic health record data. A limitation of this study is that laboratory data were derived from only one institution. Sequence time appears to be an important potential parameterization.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, New York, USA Medical Informatics Services, NewYork-Presbyterian Hospital, New York, USA
| | - David J Albers
- Department of Biomedical Informatics, Columbia University Medical Center, New York, USA
| | - Adler Perotte
- Department of Biomedical Informatics, Columbia University Medical Center, New York, USA
| |
Collapse
|
18
|
Rana S, Gupta S, Phung D, Venkatesh S. A predictive framework for modeling healthcare data with evolving clinical interventions. Stat Anal Data Min 2015. [DOI: 10.1002/sam.11262] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Santu Rana
- Centre for Pattern Recognition and Data Analytics Deakin University Geelong 3220 Australia
| | - Sunil Gupta
- Centre for Pattern Recognition and Data Analytics Deakin University Geelong 3220 Australia
| | - Dinh Phung
- Centre for Pattern Recognition and Data Analytics Deakin University Geelong 3220 Australia
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics Deakin University Geelong 3220 Australia
| |
Collapse
|
19
|
|
20
|
Sacchi L, Dagliati A, Bellazzi R. Analyzing complex patients' temporal histories: new frontiers in temporal data mining. Methods Mol Biol 2015; 1246:89-105. [PMID: 25417081 DOI: 10.1007/978-1-4939-1985-7_6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In recent years, data coming from hospital information systems (HIS) and local healthcare organizations have started to be intensively used for research purposes. This rising amount of available data allows reconstructing the compete histories of the patients, which have a strong temporal component. This chapter introduces the major challenges faced by temporal data mining researchers in an era when huge quantities of complex clinical temporal data are becoming available. The analysis is focused on the peculiar features of this kind of data and describes the methodological and technological aspects that allow managing such complex framework. The chapter shows how heterogeneous data can be processed to derive a homogeneous representation. Starting from this representation, it illustrates different techniques for jointly analyze such kind of data. Finally, the technological strategies that allow creating a common data warehouse to gather data coming from different sources and with different formats are presented.
Collapse
Affiliation(s)
- Lucia Sacchi
- Dipartimento di Ingegneria Industriale e dell'Informazione, Università degli Studi di Pavia, Via Ferrata 1, Pavia, 27100, Italy,
| | | | | |
Collapse
|
21
|
|
22
|
Wright AP, Wright AT, McCoy AB, Sittig DF. The use of sequential pattern mining to predict next prescribed medications. J Biomed Inform 2014; 53:73-80. [PMID: 25236952 DOI: 10.1016/j.jbi.2014.09.003] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2014] [Revised: 08/14/2014] [Accepted: 09/08/2014] [Indexed: 02/08/2023]
Abstract
BACKGROUND Therapy for certain medical conditions occurs in a stepwise fashion, where one medication is recommended as initial therapy and other medications follow. Sequential pattern mining is a data mining technique used to identify patterns of ordered events. OBJECTIVE To determine whether sequential pattern mining is effective for identifying temporal relationships between medications and accurately predicting the next medication likely to be prescribed for a patient. DESIGN We obtained claims data from Blue Cross Blue Shield of Texas for patients prescribed at least one diabetes medication between 2008 and 2011, and divided these into a training set (90% of patients) and test set (10% of patients). We applied the CSPADE algorithm to mine sequential patterns of diabetes medication prescriptions both at the drug class and generic drug level and ranked them by the support statistic. We then evaluated the accuracy of predictions made for which diabetes medication a patient was likely to be prescribed next. RESULTS We identified 161,497 patients who had been prescribed at least one diabetes medication. We were able to mine stepwise patterns of pharmacological therapy that were consistent with guidelines. Within three attempts, we were able to predict the medication prescribed for 90.0% of patients when making predictions by drug class, and for 64.1% when making predictions at the generic drug level. These results were stable under 10-fold cross validation, ranging from 89.1%-90.5% at the drug class level and 63.5-64.9% at the generic drug level. Using 1 or 2 items in the patient's medication history led to more accurate predictions than not using any history, but using the entire history was sometimes worse. CONCLUSION Sequential pattern mining is an effective technique to identify temporal relationships between medications and can be used to predict next steps in a patient's medication regimen. Accurate predictions can be made without using the patient's entire medication history.
Collapse
Affiliation(s)
| | - Adam T Wright
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Allison B McCoy
- Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, United States
| | - Dean F Sittig
- The University of Texas School of Biomedical Informatics at Houston and the UT-Memorial Hermann Center for Healthcare Quality & Safety, Houston, TX, United States
| |
Collapse
|
23
|
Nguyen Q, Valizadegan H, Hauskrecht M. Learning classification models with soft-label information. J Am Med Inform Assoc 2014; 21:501-8. [PMID: 24259520 PMCID: PMC3994863 DOI: 10.1136/amiajnl-2013-001964] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2013] [Revised: 10/24/2013] [Accepted: 11/01/2013] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Learning of classification models in medicine often relies on data labeled by a human expert. Since labeling of clinical data may be time-consuming, finding ways of alleviating the labeling costs is critical for our ability to automatically learn such models. In this paper we propose a new machine learning approach that is able to learn improved binary classification models more efficiently by refining the binary class information in the training phase with soft labels that reflect how strongly the human expert feels about the original class labels. MATERIALS AND METHODS Two types of methods that can learn improved binary classification models from soft labels are proposed. The first relies on probabilistic/numeric labels, the other on ordinal categorical labels. We study and demonstrate the benefits of these methods for learning an alerting model for heparin induced thrombocytopenia. The experiments are conducted on the data of 377 patient instances labeled by three different human experts. The methods are compared using the area under the receiver operating characteristic curve (AUC) score. RESULTS Our AUC results show that the new approach is capable of learning classification models more efficiently compared to traditional learning methods. The improvement in AUC is most remarkable when the number of examples we learn from is small. CONCLUSIONS A new classification learning framework that lets us learn from auxiliary soft-label information provided by a human expert is a promising new direction for learning classification models from expert labels, reducing the time and cost needed to label data.
Collapse
Affiliation(s)
- Quang Nguyen
- Computer Science Department, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | | | | |
Collapse
|
24
|
Liao V, Chen MS. Efficient mining gapped sequential patterns for motifs in biological sequences. BMC SYSTEMS BIOLOGY 2014; 7 Suppl 4:S7. [PMID: 24565366 PMCID: PMC3854651 DOI: 10.1186/1752-0509-7-s4-s7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Background Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. Biological data mining yield impact in diverse biological fields, such as discovery of co-occurring biosequences, which is important for biological data analyses. The approaches of mining sequential patterns can discover all-length motifs of biological sequences. Nevertheless, traditional approaches of mining sequential patterns inefficiently mine DNA and protein data since the data have fewer letters and lengthy sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution of biological sequences. Results We devise an approach to efficiently mine sequential patterns (motifs) with gap constraints in biological sequences. The approach is the Depth-First Spelling algorithm for mining sequential patterns of biological sequences with Gap constraints (termed DFSG). Conclusions PrefixSpan is one of the most efficient methods in traditional approaches of mining sequential patterns, and it is the basis of GenPrefixSpan. GenPrefixSpan is an approach built on PrefixSpan with gap constraints, and therefore we compare DFSG with GenPrefixSpan. In the experimental results, DFSG mines biological sequences much faster than GenPrefixSpan.
Collapse
|
25
|
Valizadegan H, Nguyen Q, Hauskrecht M. Learning classification models from multiple experts. J Biomed Inform 2013; 46:1125-35. [PMID: 24035760 PMCID: PMC3922063 DOI: 10.1016/j.jbi.2013.08.007] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2013] [Revised: 07/15/2013] [Accepted: 08/17/2013] [Indexed: 10/26/2022]
Abstract
Building classification models from clinical data using machine learning methods often relies on labeling of patient examples by human experts. Standard machine learning framework assumes the labels are assigned by a homogeneous process. However, in reality the labels may come from multiple experts and it may be difficult to obtain a set of class labels everybody agrees on; it is not uncommon that different experts have different subjective opinions on how a specific patient example should be classified. In this work we propose and study a new multi-expert learning framework that assumes the class labels are provided by multiple experts and that these experts may differ in their class label assessments. The framework explicitly models different sources of disagreements and lets us naturally combine labels from different human experts to obtain: (1) a consensus classification model representing the model the group of experts converge to, as well as, and (2) individual expert models. We test the proposed framework by building a model for the problem of detection of the Heparin Induced Thrombocytopenia (HIT) where examples are labeled by three experts. We show that our framework is superior to multiple baselines (including standard machine learning framework in which expert differences are ignored) and that our framework leads to both improved consensus and individual expert models.
Collapse
Affiliation(s)
- Hamed Valizadegan
- Department of Computer Science, University of Pittsburgh, United States.
| | | | | |
Collapse
|
26
|
Lasko TA, Denny JC, Levy MA. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS One 2013; 8:e66341. [PMID: 23826094 PMCID: PMC3691199 DOI: 10.1371/journal.pone.0066341] [Citation(s) in RCA: 140] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 05/07/2013] [Indexed: 01/14/2023] Open
Abstract
Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don’t think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data – Electronic Medical Records – typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.
Collapse
Affiliation(s)
- Thomas A Lasko
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA.
| | | | | |
Collapse
|
27
|
Batal I, Fradkin D, Harrison J, Moerchen F, Hauskrecht M. Mining Recent Temporal Patterns for Event Detection in Multivariate Time Series Data. KDD : PROCEEDINGS. INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING 2012; 2012:280-288. [PMID: 25937993 PMCID: PMC4414327 DOI: 10.1145/2339530.2339578] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes.
Collapse
Affiliation(s)
- Iyad Batal
- Dept. of Computer Science, University of Pittsburgh,
| | | | - James Harrison
- Dept. of Public Health Sciences, University of Virginia,
| | | | | |
Collapse
|