1
|
Gao X, Gong Y, Xu T, Lu J, Zhao Y, Dong X. Toward Better Structure and Constraint to Mine Negative Sequential Patterns. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:571-585. [PMID: 33332276 DOI: 10.1109/tnnls.2020.3041732] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Nonoccurring behavior (NOB) studies have attracted the growing attention of scholars as a crucial part of behavioral science. As an effective method to discover both NOB and occurring behaviors (OB), negative sequential pattern (NSP) mining is successfully used in analyzing medical treatment and abnormal behavior patterns. At this time, NSP mining is still an active and challenging research domain. Most of the algorithms are inefficient in practice. Briefly, the key weaknesses of NSP mining are: 1) an inefficient positive sequential pattern (PSP) mining process, 2) a strict constraint of negative containment, and 3) the lack of an effective Negative Sequential Candidate (NSC) generation method. To address these weaknesses, we propose a highly efficient algorithm with improved techniques, named sc-NSP, to mine NSP efficiently. We first propose an improved PrefixSpan algorithm in the PSP mining process, which connects to a bitmap storage structure instead of the original structure. Second, sc-NSP loosens the frequency constraint and exploits the NSC generation method of positive and negative sequential patterns mining (PNSP) (a classic NSP mining method). Furthermore, a novel pruning strategy is designed to reduce the computational complexity of sc-NSP. Finally, sc-NSP obtains the support of NSC by using the most efficient bitwise-based calculation operation. Theoretical analyses show that sc-NSP performs particularly well on data sets with a large number of elements and items in sequence. Comparison and extensive experiments along with case studies on health data show that sc-NSP is 10 times more efficient than other state-of-the-art methods, and the number of NSPs obtained is 5 times greater than other methods.
Collapse
|
2
|
Deeva G, De Smedt J, Saint-Pierre C, Weber R, De Weerdt J. Predicting student performance using sequence classification with time-based windows. EXPERT SYSTEMS WITH APPLICATIONS 2022; 209:118182. [PMID: 35966368 PMCID: PMC9359516 DOI: 10.1016/j.eswa.2022.118182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/19/2021] [Revised: 05/24/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]
Abstract
A growing number of universities worldwide use various forms of online and blended learning as part of their academic curricula. Furthermore, the recent changes caused by the COVID-19 pandemic have led to a drastic increase in importance and ubiquity of online education. Among the major advantages of e-learning is not only improving students' learning experience and widening their educational prospects, but also an opportunity to gain insights into students' learning processes with learning analytics. This study contributes to the topic of improving and understanding e-learning processes in the following ways. First, we demonstrate that accurate predictive models can be built based on sequential patterns derived from students' behavioral data, which are able to identify underperforming students early in the course. Second, we investigate the specificity-generalizability trade-off in building such predictive models by investigating whether predictive models should be built for every course individually based on course-specific sequential patterns, or across several courses based on more general behavioral patterns. Finally, we present a methodology for capturing temporal aspects in behavioral data and analyze its influence on the predictive performance of the models. The results of our improved sequence classification technique are capable to predict student performance with high levels of accuracy, reaching 90% for course-specific models.
Collapse
Affiliation(s)
- Galina Deeva
- Research Centre for Information Systems Engineering, KU Leuven, Belgium
| | - Johannes De Smedt
- Research Centre for Information Systems Engineering, KU Leuven, Belgium
| | | | - Richard Weber
- Department of Industrial Engineering, FCFM, Instituto Sistemas Complejos de Ingeniería, Universidad de Chile, Chile
| | - Jochen De Weerdt
- Research Centre for Information Systems Engineering, KU Leuven, Belgium
| |
Collapse
|
3
|
Feremans L, Cule B, Goethals B. PETSC: pattern-based embedding for time series classification. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-022-00822-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
4
|
Chen R, Sun H, Chen L, Zhang J, Wang S. Dynamic order Markov model for categorical sequence clustering. JOURNAL OF BIG DATA 2021; 8:154. [PMID: 34900517 PMCID: PMC8651609 DOI: 10.1186/s40537-021-00547-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Accepted: 11/23/2021] [Indexed: 06/14/2023]
Abstract
Markov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from sub-frequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on real-world datasets demonstrate the promising performance of the proposed model.
Collapse
Affiliation(s)
- Rongbo Chen
- Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada
| | - Haojun Sun
- Department of Computer Science, Shantou University, Shantou, China
| | - Lifei Chen
- Department of Computer Science, Fujian Normal University, Fuzhou, China
| | - Jianfei Zhang
- Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada
| | - Shengrui Wang
- Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada
| |
Collapse
|
5
|
Schvetz M, Fuchs L, Novack V, Moskovitch R. Outcomes prediction in longitudinal data: Study designs evaluation, use case in ICU acquired sepsis. J Biomed Inform 2021; 117:103734. [PMID: 33711544 DOI: 10.1016/j.jbi.2021.103734] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2020] [Revised: 02/27/2021] [Accepted: 03/01/2021] [Indexed: 12/23/2022]
Abstract
Outcomes' prediction in Electronic Health Records (EHR) and specifically in Critical Care is increasingly attracting more exploration and research. In this study, we used clinical data from the Intensive Care Unit (ICU), focusing on ICU acquired sepsis. Looking at the current literature, several evaluation approaches are reported, inspired by epidemiological designs, in which some do not always reflect real-life application's conditions. This problem seems relevant generally to outcomes' prediction in longitudinal EHR data, or generally longitudinal data, while in this study we focused on ICU data. Unlike in most previous studies that investigated all sepsis admissions, we focused specifically on ICU-Acquired Sepsis. Due to the sparse nature of the longitudinal data, we employed the use of Temporal Abstraction and Time Interval-Related Patterns discovery, which are further used as classification features. Two experiments were designed using three different outcomes prediction study designs from the literature, implementing various levels of real-life conditions to evaluate the prediction models. The first experiment focused on predicting whether a patient would suffer from ICU-acquired sepsis and when during her admission, given a sliding observation time window, and the comparison of the three study designs behavior. The second experiment focused only on predicting whether the patient will suffer from ICU-acquired sepsis, based on data taken relatively to his admission start time. Our results show that using Temporal Discretization for Classification (TD4C) led to better performance than using the Equal-Width Discretization, Knowledge-Based, or SAX. Also, using two states abstraction was better than three or four. Using the default Binary TIRP representation method performed better than Mean Duration, Horizontal Support, and horizontally normalized horizontal support. Using XGBoost as a classifier performed better than Logistic Regression, Neural Net, or Random Forest. Additionally, it is demonstrated why the use of case-crossover-control is most appropriate for real life application conditions evaluation, unlike other incomplete designs that may even result in "better performance".
Collapse
Affiliation(s)
- Maya Schvetz
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer-Sheva, Israel.
| | - Lior Fuchs
- Medical Intensive Care Unit and Clinical Research Center, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
| | - Victor Novack
- Clinical Research Center, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
| | - Robert Moskovitch
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer-Sheva, Israel.
| |
Collapse
|
6
|
|
7
|
Cheng F, Chen J, Qiu J, Zhang L. A subregion division based multi-objective evolutionary algorithm for SVM training set selection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.02.028] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
8
|
Dong X, Gong Y, Cao L. e-RNSP: An Efficient Method for Mining Repetition Negative Sequential Patterns. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:2084-2096. [PMID: 30296247 DOI: 10.1109/tcyb.2018.2869907] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Negative sequential patterns (NSPs), which capture both frequent occurring and nonoccurring behaviors, become increasingly important and sometimes play a role irreplaceable by analyzing occurring behaviors only. Repetition sequential patterns capture repetitions of patterns in different sequences as well as within a sequence and are very important to understand the repetition relations between behaviors. Though some methods are available for mining NSP and repetition positive sequential patterns (RPSPs), we have not found any methods for mining repetition NSP (RNSP). RNSP can help the analysts to further understand the repetition relationships between items and capture more comprehensive information with repetition properties. However, mining RNSP is much more difficult than mining NSP due to the intrinsic challenges of nonoccurring items. To address the above issues, we first propose a formal definition of repetition negative containment. Then, we propose a method to convert repetition negative containment to repetition positive containment, which fast calculates the repetition supports by only using the corresponding RPSP's information without rescanning databases. Finally, we propose an efficient algorithm, called e-RNSP, to mine RNSP efficiently. To the best of our knowledge, e-RNSP is the first algorithm to efficiently mine RNSP. Intensive experimental results on the first four real and synthetic datasets clearly show that e-RNSP can efficiently discover the repetition negative patterns; results on the fifth dataset prove the effectiveness of RNSP which are captured by the proposed method; and the results on the rest 16 datasets analyze the impacts of data characteristics on mining process.
Collapse
|
9
|
|
10
|
|
11
|
Shknevsky A, Shahar Y, Moskovitch R. Consistent discovery of frequent interval-based temporal patterns in chronic patients' data. J Biomed Inform 2017; 75:83-95. [PMID: 28987378 DOI: 10.1016/j.jbi.2017.10.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Revised: 08/23/2017] [Accepted: 10/02/2017] [Indexed: 11/24/2022]
Abstract
Increasingly, frequent temporal patterns discovered in longitudinal patient records are proposed as features for classification and prediction, and as means to cluster patient clinical trajectories. However, to justify that, we must demonstrate that most frequent temporal patterns are indeed consistently discoverable within the records of different patient subsets within similar patient populations. We have developed several measures for the consistency of the discovery of temporal patterns. We focus on time-interval relations patterns (TIRPs) that can be discovered within different subsets of the same patient population. We expect the discovered TIRPs (1) to be frequent in each subset, (2) preserve their "local" metrics - the absolute frequency of each pattern, measured by a Proportion Test, and (3) preserve their "global" characteristics - their overall distribution, measured by a Kolmogorov-Smirnov test. We also wanted to examine the effect on consistency, over a variety of settings, of varying the minimal frequency threshold for TIRP discovery, and of using a TIRP-filtering criterion that we previously introduced, the Semantic Adjacency Criterion (SAC). We applied our methodology to three medical domains (oncology, infectious hepatitis, and diabetes). We found that, within the minimal frequency ranges we had examined, 70-95% of the discovered TIRPs were consistently discoverable; 40-48% of them maintained their local frequency. TIRP global distribution similarity varied widely, from 0% to 65%. Increasing the threshold usually increased the percentage of TIRPs that were repeatedly discovered across different patient subsets within the same domain, and the probability of a similar TIRP distribution. Using the SAC principle, enhanced, for most minimal support levels, the percentage of repeating TIRPs, their local consistency and their global consistency. The effect of using the SAC was further strengthened as the minimal frequency threshold was raised.
Collapse
Affiliation(s)
- Alexander Shknevsky
- Software and Information Systems Engineering, Ben-Gurion University, Beer Sheva, Israel.
| | - Yuval Shahar
- Software and Information Systems Engineering, Ben-Gurion University, Beer Sheva, Israel.
| | - Robert Moskovitch
- Software and Information Systems Engineering, Ben-Gurion University, Beer Sheva, Israel.
| |
Collapse
|
12
|
|
13
|
Kostakis O, Papapetrou P. On searching and indexing sequences of temporal intervals. Data Min Knowl Discov 2017. [DOI: 10.1007/s10618-016-0489-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
14
|
Xie F, Wu X, Zhu X. Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2016.10.011] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
15
|
Egho E, Gay D, Boullé M, Voisine N, Clérot F. A user parameter-free approach for mining robust sequential classification rules. Knowl Inf Syst 2016. [DOI: 10.1007/s10115-016-1002-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|