1
|
Yoshida T, Hanada H, Nakagawa K, Taji K, Tsuda K, Takeuchi I. Efficient model selection for predictive pattern mining model by safe pattern pruning. PATTERNS (NEW YORK, N.Y.) 2023; 4:100890. [PMID: 38106611 PMCID: PMC10724371 DOI: 10.1016/j.patter.2023.100890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 11/02/2023] [Accepted: 11/09/2023] [Indexed: 12/19/2023]
Abstract
Predictive pattern mining is an approach used to construct prediction models when the input is represented by structured data, such as sets, graphs, and sequences. The main idea behind predictive pattern mining is to build a prediction model by considering unified inconsistent notation sub-structures, such as subsets, subgraphs, and subsequences (referred to as patterns), present in the structured data as features of the model. The primary challenge in predictive pattern mining lies in the exponential growth of the number of patterns with the complexity of the structured data. In this study, we propose the safe pattern pruning method to address the explosion of pattern numbers in predictive pattern mining. We also discuss how it can be effectively employed throughout the entire model building process in practical data analysis. To demonstrate the effectiveness of the proposed method, we conduct numerical experiments on regression and classification problems involving sets, graphs, and sequences.
Collapse
Affiliation(s)
- Takumi Yoshida
- Department of Engineering, Nagoya Institute of Technology, Nagoya, Aichi 466-8555, Japan
| | - Hiroyuki Hanada
- Center for Advanced Intelligence Project, RIKEN, Tokyo 103-0027, Japan
| | - Kazuya Nakagawa
- Department of Engineering, Nagoya Institute of Technology, Nagoya, Aichi 466-8555, Japan
| | - Kouichi Taji
- Department of Mechanical Systems Engineering, Nagoya University, Nagoya, Aichi 464-8603, Japan
| | - Koji Tsuda
- Center for Advanced Intelligence Project, RIKEN, Tokyo 103-0027, Japan
- Department of Bioinformatics and Systems Biology, The University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
| | - Ichiro Takeuchi
- Center for Advanced Intelligence Project, RIKEN, Tokyo 103-0027, Japan
- Department of Mechanical Systems Engineering, Nagoya University, Nagoya, Aichi 464-8603, Japan
| |
Collapse
|
2
|
Banerjee S, Karunagaran D. An integrated approach for mining precise RNA-based cervical cancer staging biomarkers. Gene 2019; 712:143961. [PMID: 31279709 DOI: 10.1016/j.gene.2019.143961] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Revised: 07/02/2019] [Accepted: 07/03/2019] [Indexed: 02/06/2023]
Abstract
Since international federation of gynecology and obstetrics (FIGO) staging is mainly based on clinical assessment, an integrated approach for mining RNA based biomarkers for understanding the molecular deregulation of signaling pathways and RNAs in cervical cancer was proposed in this study. Publicly available data were mined for identifying significant RNAs after patient staging. Significant miRNA families were identified from mRNA-miRNA and lncRNA-miRNA interaction network analyses followed by stage specific mRNA-miRNA-lncRNA association network generation. Integrated bioinformatic analyses of selected mRNAs and lncRNAs were performed. Results suggest that HBA1, HBA2, HBB, SLC2A1, CXCL10 (stage I), PKIA (stage III) and S100A7 (stage IV) were important. miRNA family enrichment of interacting miRNA partners of selected RNAs indicated the enrichment of let-7 family. Assembly of collagen fibrils and other multimeric structures_Homosapiens_R-HSA-2022090 in pathway analysis and progesterone_CTD_00006624 in DSigDB analysis were the most significant and SLC2A1, hsa-miR-188-3p, hsa-miR-378a-3p and hsa-miR-150-5p were selected as survival markers.
Collapse
Affiliation(s)
- Satarupa Banerjee
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, IIT Madras, Chennai 600036, India
| | - Devarajan Karunagaran
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, IIT Madras, Chennai 600036, India.
| |
Collapse
|
3
|
Tietz T, Selinski S, Golka K, Hengstler JG, Gripp S, Ickstadt K, Ruczinski I, Schwender H. Identification of interactions of binary variables associated with survival time using survivalFS. Arch Toxicol 2019; 93:585-602. [PMID: 30694373 DOI: 10.1007/s00204-019-02398-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 01/16/2019] [Indexed: 12/01/2022]
Abstract
Many medical studies aim to identify factors associated with a time to an event such as survival time or time to relapse. Often, in particular, when binary variables are considered in such studies, interactions of these variables might be the actual relevant factors for predicting, e.g., the time to recurrence of a disease. Testing all possible interactions is often not possible, so that procedures such as logic regression are required that avoid such an exhaustive search. In this article, we present an ensemble method based on logic regression that can cope with the instability of the regression models generated by logic regression. This procedure called survivalFS also provides measures for quantifying the importance of the interactions forming the logic regression models on the time to an event and for the assessment of the individual variables that take the multivariate data structure into account. In this context, we introduce a new performance measure, which is an adaptation of Harrel's concordance index. The performance of survivalFS and the proposed importance measures is evaluated in a simulation study as well as in an application to genotype data from a urinary bladder cancer study. Furthermore, we compare the performance of survivalFS and its importance measures for the individual variables with the variable importance measure used in random survival forests, a popular procedure for the analysis of survival data. These applications show that survivalFS is able to identify interactions associated with time to an event and to outperform random survival forests.
Collapse
Affiliation(s)
- Tobias Tietz
- Mathematical Institute, Heinrich Heine University Düsseldorf, 40225, Düsseldorf, Germany
| | - Silvia Selinski
- Leibniz Research Centre for Working Environment and Human Factors, TU Dortmund University, IfADo, 44139, Dortmund, Germany
| | - Klaus Golka
- Leibniz Research Centre for Working Environment and Human Factors, TU Dortmund University, IfADo, 44139, Dortmund, Germany
| | - Jan G Hengstler
- Leibniz Research Centre for Working Environment and Human Factors, TU Dortmund University, IfADo, 44139, Dortmund, Germany
| | - Stephan Gripp
- Department of Radiation Oncology, Heinrich Heine University Hospital, 44225, Düsseldorf, Germany
| | - Katja Ickstadt
- Faculty of Statistics, TU Dortmund University, 44221, Dortmund, Germany
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205, USA
| | - Holger Schwender
- Mathematical Institute, Heinrich Heine University Düsseldorf, 40225, Düsseldorf, Germany.
| |
Collapse
|
4
|
Relator RT, Terada A, Sese J. Identifying statistically significant combinatorial markers for survival analysis. BMC Med Genomics 2018; 11:31. [PMID: 29697363 PMCID: PMC5918465 DOI: 10.1186/s12920-018-0346-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Background Survival analysis methods have been widely applied in different areas of health and medicine, spanning over varying events of interest and target diseases. They can be utilized to provide relationships between the survival time of individuals and factors of interest, rendering them useful in searching for biomarkers in diseases such as cancer. However, some disease progression can be very unpredictable because the conventional approaches have failed to consider multiple-marker interactions. An exponential increase in the number of candidate markers requires large correction factor in the multiple-testing correction and hide the significance. Methods We address the issue of testing marker combinations that affect survival by adapting the recently developed Limitless Arity Multiple-testing Procedure (LAMP), a p-value correction technique for statistical tests for combination of markers. LAMP cannot handle survival data statistics, and hence we extended LAMP for the log-rank test, making it more appropriate for clinical data, with newly introduced theoretical lower bound of the p-value. Results We applied the proposed method to gene combination detection for cancer and obtained gene interactions with statistically significant log-rank p-values. Gene combinations with orders of up to 32 genes were detected by our algorithm, and effects of some genes in these combinations are also supported by existing literature. Conclusion The novel approach for detecting prognostic markers presented here can identify statistically significant markers with no limitations on the order of interaction. Furthermore, it can be applied to different types of genomic data, provided that binarization is possible.
Collapse
Affiliation(s)
- Raissa T Relator
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Aika Terada
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.,PRESTO, Japan Science and Technology Agency, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan.,Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, Chiba, 277-8561, Japan
| | - Jun Sese
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), 2-12-1 Okayama, Meguro-ku, Tokyo, 152-8550, Japan.
| |
Collapse
|
5
|
Sariyar M, Hoffmann I, Binder H. Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data. BMC Bioinformatics 2014; 15:58. [PMID: 24571520 PMCID: PMC3945780 DOI: 10.1186/1471-2105-15-58] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 01/28/2014] [Indexed: 11/23/2022] Open
Abstract
Background Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. Results We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. Conclusion Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.
Collapse
Affiliation(s)
- Murat Sariyar
- Institute of Medical Biostatistics, Epidemiology and Informatics, Medical Center of the Johannes Gutenberg University, Mainz 55131, Germany.
| | | | | |
Collapse
|