1
|
Nawaz MS, Fournier-Viger P, Nawaz S, Zhu H, Yun U. SPM4GAC: SPM based approach for genome analysis and classification of macromolecules. Int J Biol Macromol 2024; 266:130984. [PMID: 38513910 DOI: 10.1016/j.ijbiomac.2024.130984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 03/16/2024] [Indexed: 03/23/2024]
Abstract
Genome sequence analysis and classification play critical roles in properly understanding an organism's main characteristics, functionalities, and changing (evolving) nature. However, the rapid expansion of genomic data makes genome sequence analysis and classification a challenging task due to the high computational requirements, proper management, and understanding of genomic data. Recently proposed models yielded promising results for the task of genome sequence classification. Nevertheless, these models often ignore the sequential nature of nucleotides, which is crucial for revealing their underlying structure and function. To address this limitation, we present SPM4GAC, a sequential pattern mining (SPM)-based framework to analyze and classify the macromolecule genome sequences of viruses. First, a large dataset containing the genome sequences of various RNA viruses is developed and transformed into a suitable format. On the transformed dataset, algorithms for SPM are used to identify frequent sequential patterns of nucleotide bases. The obtained frequent sequential patterns of bases are then used as features to classify different viruses. Ten classifiers are employed, and their performance is assessed by using several evaluation measures. Finally, a performance comparison of SPM4GAC with state-of-the-art methods for genome sequence classification/detection reveals that SPM4GAC performs better than those methods.
Collapse
Affiliation(s)
- M Saqib Nawaz
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China.
| | | | - Shoaib Nawaz
- Department of Pharmacy, The University of Lahore, Sargodha Campus, Pakistan.
| | - Haowei Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China.
| | - Unil Yun
- Sejong University, Seoul, Republic of Korea.
| |
Collapse
|
2
|
Ott J, Park T. Overview of frequent pattern mining. Genomics Inform 2022; 20:e39. [PMID: 36617647 PMCID: PMC9847378 DOI: 10.5808/gi.22074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 12/22/2022] [Indexed: 12/31/2022] Open
Abstract
Various methods of frequent pattern mining have been applied to genetic problems, specifically, to the combined association of two genotypes (a genotype pattern, or diplotype) at different DNA variants with disease. These methods have the ability to come up with a selection of genotype patterns that are more common in affected than unaffected individuals, and the assessment of statistical significance for these selected patterns poses some unique problems, which are briefly outlined here.
Collapse
Affiliation(s)
- Jurg Ott
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10065, USA,Corresponding author E-mail:
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea
| |
Collapse
|
3
|
Fuzzy-driven Periodic Frequent Pattern Mining. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
4
|
Mining Statistically Significant Patterns with High Utility. INT J COMPUT INT SYS 2022. [DOI: 10.1007/s44196-022-00149-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
AbstractStatistically significant pattern mining (SSPM) is to mine patterns with significance based on hypothesis test. Under the constraint of statistical significance, our study aims to introduce a new preference relation into high utility patterns and to discover high utility and significant patterns (HUSPs) from transaction datasets, which has never been considered in existing SSPM problems. Our approach can be divided into two parts, HUSP-Mining and HUSP-Test. HUSP-Mining looks for HUSP candidates and HUSP-Test tests their significance. HUSP-Mining is not outputting all high utility itemsets (HUIs) as HUSP candidates; it is established based on candidate length and testable support requirements which can remove many insignificant HUIs early in the mining process; compared with the traditional HUIs mining algorithm, it can get candidates in a short time without losing the real HUSPs. HUSP-Test is to draw significant patterns from the results of HUSP-Mining based on Fisher’s test. We propose an iterative multiple testing procedure, which can alternately and efficiently reject a hypothesis and safely ignore the hypotheses that have less utility than the rejected hypothesis. HUSP-Test controls Family-wise Error Rate (FWER) under a user-defined threshold by correcting the test level which can find more HUSPs than standard Bonferroni’s control. Substantial experiments on real datasets show that our algorithm can draw HUSPs efficiently from transaction datasets with strong mathematical guarantee.
Collapse
|
5
|
Le B, Truong T, Duong H, Fournier-Viger P, Fujita H. H-FHAUI: Hiding Frequent High Average Utility Itemsets. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
6
|
Yang Q, Luo T, Zhang W, Zhong X, He P, Zheng H. Data-driven treatment pathways mining for early breast cancer using cSPADE algorithm and system clustering. Int J Health Plann Manage 2022; 37:2569-2584. [PMID: 35445441 DOI: 10.1002/hpm.3483] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 02/09/2022] [Accepted: 03/30/2022] [Indexed: 02/05/2023] Open
Abstract
OBJECTIVES Due to the multidimensional, multilayered, and chronological order of the cancer data, it was challenging for us to extract treatment paths. To determine whether the cSPADE algorithm and system clustering proposed in this study can effectively identify the treatment pathways for early breast cancer. METHODS We applied data mining technology to the electronic medical records of 6891 early breast cancer patients to mine treatment pathways. We provided a method of extracting data from EMR and performed three-stage mining: determining the treatment stage through the cSPADE algorithm → system clustering for treatment plan extraction → cSPADE mining sequence pattern for treatment. The Kolmogorov-Smirnov test and correlation analysis were used to cross-validate the sequence rules of early breast cancer treatment pathways. RESULTS We unearthed 55 sequence rules for early breast cancer treatment, 3 preoperative neoadjuvant chemotherapy regimens, three postoperative chemotherapy regimens, and 2 chemotherapy regimens for patients without surgery. Through 5-fold cross-validation, Pearson and Spearman correlation tests were performed. At the significance level of p < 0.05, all correlation coefficients of support, confidence and lift were greater than 0.89. Using the Kolmogorov-Smirnov test, we found no significant differences between the sequence distributions. CONCLUSIONS We have proved that cSPADE algorithm combined system clustering is an effective technique for identifying temporal relationships between treatment modalities, enabling hierarchical and vertical mining of breast cancer treatment models. In addition, we confirmed the robustness of the results by cross-validation of these treatment pathway ordering rules. Through this method, the treatment path of early breast cancer patients can be revealed, and the real-world breast cancer treatment behaviour model can be evaluated, which can provide reference for the redesign and optimization of treatment path.
Collapse
Affiliation(s)
- Qing Yang
- Institute of Hospital Management, West China Hospital, Sichuan University, Chengdu, China
| | - Ting Luo
- Department of Head, Neck and Mammary Gland Oncology, Cancer Center, West China Hospital, Sichuan University, Chengdu, China
| | - Wei Zhang
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Xiaorong Zhong
- Department of Head, Neck and Mammary Gland Oncology, Cancer Center, West China Hospital, Sichuan University, Chengdu, China.,Laboratory of Molecular Diagnosis of Cancer, Clinical Research Center for Breast, West China Hospital, Sichuan University, Chengdu, China
| | - Ping He
- Department of Head, Neck and Mammary Gland Oncology, Cancer Center, West China Hospital, Sichuan University, Chengdu, China
| | - Hong Zheng
- Department of Head, Neck and Mammary Gland Oncology, Cancer Center, West China Hospital, Sichuan University, Chengdu, China.,Laboratory of Molecular Diagnosis of Cancer, Clinical Research Center for Breast, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
7
|
Huynh U, Le B, Dinh DT, Fujita H. Multi-core parallel algorithms for hiding high-utility sequential patterns. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
|
9
|
Nawaz MS, Fournier-Viger P, Shojaee A, Fujita H. Using artificial intelligence techniques for COVID-19 genome analysis. APPL INTELL 2021; 51:3086-3103. [PMID: 34764587 PMCID: PMC7888282 DOI: 10.1007/s10489-021-02193-w] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/04/2021] [Indexed: 12/25/2022]
Abstract
The genome of the novel coronavirus (COVID-19) disease was first sequenced in January 2020, approximately a month after its emergence in Wuhan, capital of Hubei province, China. COVID-19 genome sequencing is critical to understanding the virus behavior, its origin, how fast it mutates, and for the development of drugs/vaccines and effective preventive strategies. This paper investigates the use of artificial intelligence techniques to learn interesting information from COVID-19 genome sequences. Sequential pattern mining (SPM) is first applied on a computer-understandable corpus of COVID-19 genome sequences to see if interesting hidden patterns can be found, which reveal frequent patterns of nucleotide bases and their relationships with each other. Second, sequence prediction models are applied to the corpus to evaluate if nucleotide base(s) can be predicted from previous ones. Third, for mutation analysis in genome sequences, an algorithm is designed to find the locations in the genome sequences where the nucleotide bases are changed and to calculate the mutation rate. Obtained results suggest that SPM and mutation analysis techniques can reveal interesting information and patterns in COVID-19 genome sequences to examine the evolution and variations in COVID-19 strains respectively.
Collapse
Affiliation(s)
- M. Saqib Nawaz
- School of Humanities and Social Sciences, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Philippe Fournier-Viger
- School of Humanities and Social Sciences, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | | | - Hamido Fujita
- Faculty of Software and Information Science, Iwate Prefectural University, Iwate, Japan
| |
Collapse
|
10
|
Gan W, Lin JCW, Zhang J, Fournier-Viger P, Chao HC, Yu PS. Fast Utility Mining on Sequence Data. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:487-500. [PMID: 32142464 DOI: 10.1109/tcyb.2020.2970176] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
High-utility sequential pattern (HUSP) mining is an emerging topic in the field of knowledge discovery in databases. It consists of discovering subsequences that have a high utility (importance) in sequences, which can be referred to as HUSPs. HUSPs can be applied to many real-life applications, such as market basket analysis, e-commerce recommendations, click-stream analysis, and route planning. Several algorithms have been proposed to efficiently mine utility-based useful sequential patterns. However, due to the combinatorial explosion of the search space for low utility threshold and large-scale data, the performances of these algorithms are unsatisfactory in terms of runtime and memory usage. Hence, this article proposes an efficient algorithm for the task of HUSP mining, called HUSP mining with UL-list (HUSP-ULL). It utilizes a lexicographic q -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper bounds on the utility of the candidate sequences and reduce the search space by pruning unpromising candidates early. Substantial experiments on both real-life and synthetic datasets showed that HUSP-ULL can effectively and efficiently discover the complete set of HUSPs and that it outperforms the state-of-the-art algorithms.
Collapse
|
11
|
Gan W, Lin JCW, Fournier-Viger P, Chao HC, Yu PS. HUOPM: High-Utility Occupancy Pattern Mining. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:1195-1208. [PMID: 30794524 DOI: 10.1109/tcyb.2019.2896267] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Mining useful patterns from varied types of databases is an important research topic, which has many real-life applications. Most studies have considered the frequency as sole interestingness measure to identify high-quality patterns. However, each object is different in nature. The relative importance of objects is not equal, in terms of criteria, such as the utility, risk, or interest. Besides, another limitation of frequent patterns is that they generally have a low occupancy, that is, they often represent small sets of items in transactions containing many items and, thus, may not be truly representative of these transactions. To extract high-quality patterns in real-life applications, this paper extends the occupancy measure to also assess the utility of patterns in transaction databases. We propose an efficient algorithm named high-utility occupancy pattern mining (HUOPM). It considers user preferences in terms of frequency, utility, and occupancy. A novel frequency-utility tree and two compact data structures, called the utility-occupancy list and frequency-utility table, are designed to provide global and partial downward closure properties for pruning the search space. The proposed method can efficiently discover the complete set of high-quality patterns without candidate generation. Extensive experiments have been conducted on several datasets to evaluate the effectiveness and efficiency of the proposed algorithm. Results show that the derived patterns are intelligible, reasonable, and acceptable, and that HUOPM with its pruning strategies outperforms the state-of-the-art algorithm, in terms of runtime and search space, respectively.
Collapse
|
12
|
|