1
|
HDSHUI-miner: a novel algorithm for discovering spatial high-utility itemsets in high-dimensional spatiotemporal databases. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04436-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2023]
|
2
|
Hui2Vec: Learning Transaction Embedding Through High Utility Itemsets. BIG DATA ANALYTICS 2022. [DOI: 10.1007/978-3-031-24094-2_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
|
3
|
|
4
|
|
5
|
AbdelAziz AM, Soliman T, Ghany KKA, Sewisy A. A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark. PeerJ Comput Sci 2021; 7:e416. [PMID: 33834101 PMCID: PMC8022636 DOI: 10.7717/peerj-cs.416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 02/05/2021] [Indexed: 06/12/2023]
Abstract
A microarray is a revolutionary tool that generates vast volumes of data that describe the expression profiles of genes under investigation that can be qualified as Big Data. Hadoop and Spark are efficient frameworks, developed to store and analyze Big Data. Analyzing microarray data helps researchers to identify correlated genes. Clustering has been successfully applied to analyze microarray data by grouping genes with similar expression profiles into clusters. The complex nature of microarray data obligated clustering methods to employ multiple evaluation functions to ensure obtaining solutions with high quality. This transformed the clustering problem into a Multi-Objective Problem (MOP). A new and efficient hybrid Multi-Objective Whale Optimization Algorithm with Tabu Search (MOWOATS) was proposed to solve MOPs. In this article, MOWOATS is proposed to analyze massive microarray datasets. Three evaluation functions have been developed to ensure an effective assessment of solutions. MOWOATS has been adapted to run in parallel using Spark over Hadoop computing clusters. The quality of the generated solutions was evaluated based on different indices, such as Silhouette and Davies-Bouldin indices. The obtained clusters were very similar to the original classes. Regarding the scalability, the running time was inversely proportional to the number of computing nodes.
Collapse
Affiliation(s)
| | - Taysir Soliman
- Faculty of Computers and Information, Assiut University, Egypt
| | - Kareem Kamal A. Ghany
- Faculty of Computers and Artificial Intelligence, Beni-Suef University, Egypt
- College of Computing and Informatics, Saudi Electronic University, Riyadh, KSA
| | - Adel Sewisy
- Faculty of Computers and Information, Assiut University, Egypt
| |
Collapse
|
6
|
Anguita-Ruiz A, Segura-Delgado A, Alcalá R, Aguilera CM, Alcalá-Fdez J. eXplainable Artificial Intelligence (XAI) for the identification of biologically relevant gene expression patterns in longitudinal human studies, insights from obesity research. PLoS Comput Biol 2020; 16:e1007792. [PMID: 32275707 PMCID: PMC7176286 DOI: 10.1371/journal.pcbi.1007792] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 04/22/2020] [Accepted: 03/17/2020] [Indexed: 12/18/2022] Open
Abstract
Until date, several machine learning approaches have been proposed for the dynamic modeling of temporal omics data. Although they have yielded impressive results in terms of model accuracy and predictive ability, most of these applications are based on "Black-box" algorithms and more interpretable models have been claimed by the research community. The recent eXplainable Artificial Intelligence (XAI) revolution offers a solution for this issue, were rule-based approaches are highly suitable for explanatory purposes. The further integration of the data mining process along with functional-annotation and pathway analyses is an additional way towards more explanatory and biologically soundness models. In this paper, we present a novel rule-based XAI strategy (including pre-processing, knowledge-extraction and functional validation) for finding biologically relevant sequential patterns from longitudinal human gene expression data (GED). To illustrate the performance of our pipeline, we work on in vivo temporal GED collected within the course of a long-term dietary intervention in 57 subjects with obesity (GSE77962). As validation populations, we employ three independent datasets following the same experimental design. As a result, we validate primarily extracted gene patterns and prove the goodness of our strategy for the mining of biologically relevant gene-gene temporal relations. Our whole pipeline has been gathered under open-source software and could be easily extended to other human temporal GED applications.
Collapse
Affiliation(s)
- Augusto Anguita-Ruiz
- Department of Biochemistry and Molecular Biology II, Institute of Nutrition and Food Technology "José Mataix", Center of Biomedical Research, University of Granada, Granada, Spain
- Instituto de Investigación Biosanitaria ibs.GRANADA, Granada, Spain
- CIBEROBN (Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III (ISCIII), Madrid, Spain
| | - Alberto Segura-Delgado
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Rafael Alcalá
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Concepción M. Aguilera
- Department of Biochemistry and Molecular Biology II, Institute of Nutrition and Food Technology "José Mataix", Center of Biomedical Research, University of Granada, Granada, Spain
- Instituto de Investigación Biosanitaria ibs.GRANADA, Granada, Spain
- CIBEROBN (Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III (ISCIII), Madrid, Spain
| | - Jesús Alcalá-Fdez
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| |
Collapse
|
7
|
Parraga-Alava J, Dorn M, Inostroza-Ponta M. A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Min 2018; 11:16. [PMID: 30100924 PMCID: PMC6081857 DOI: 10.1186/s13040-018-0178-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 07/29/2018] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Biologists aim to understand the genetic background of diseases, metabolic disorders or any other genetic condition. Microarrays are one of the main high-throughput technologies for collecting information about the behaviour of genetic information on different conditions. In order to analyse this data, clustering arises as one of the main techniques used, and it aims at finding groups of genes that have some criterion in common, like similar expression profile. However, the problem of finding groups is normally multi dimensional, making necessary to approach the clustering as a multi-objective problem where various cluster validity indexes are simultaneously optimised. They are usually based on criteria like compactness and separation, which may not be sufficient since they can not guarantee the generation of clusters that have both similar expression patterns and biological coherence. METHOD We propose a Multi-Objective Clustering algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) to find clusters of genes with high levels of co-expression, biological coherence, and also good compactness and separation. Cluster quality indexes are used to optimise simultaneously gene relationships at expression level and biological functionality. Our proposal also includes intensification and diversification strategies to improve the search process. RESULTS The effectiveness of the proposed algorithm is demonstrated on four publicly available datasets. Comparative studies of the use of different objective functions and other widely used microarray clustering techniques are reported. Statistical, visual and biological significance tests are carried out to show the superiority of the proposed algorithm. CONCLUSIONS Integrating a-priori biological knowledge into a multi-objective approach and using intensification and diversification strategies allow the proposed algorithm to find solutions with higher quality than other microarray clustering techniques available in the literature in terms of co-expression, biological coherence, compactness and separation.
Collapse
Affiliation(s)
- Jorge Parraga-Alava
- Centre for Biotechnology and Bioengineering (CeBiB), Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Santiago, Chile
- Carrera de Computación, Escuela Superior Politécnica Agropecuaria de Manabí Manuel Félix López, Campus Politécnico Sitio El Limón, Calceta, Ecuador
| | - Marcio Dorn
- Instituto de Informatica, Universidade Federal do Rio Grande do Sul, Av. Bento Gonçalves 9500, Porto Alegre, 91501-970 Brasil
| | - Mario Inostroza-Ponta
- Centre for Biotechnology and Bioengineering (CeBiB), Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Santiago, Chile
| |
Collapse
|
8
|
|
9
|
|
10
|
Zida S, Fournier-Viger P, Lin JCW, Wu CW, Tseng VS. EFIM: a fast and memory efficient algorithm for high-utility itemset mining. Knowl Inf Syst 2016. [DOI: 10.1007/s10115-016-0986-0] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
11
|
Naulaerts S, Moens S, Engelen K, Berghe WV, Goethals B, Laukens K, Meysman P. Practical Approaches for Mining Frequent Patterns in Molecular Datasets. Bioinform Biol Insights 2016; 10:37-47. [PMID: 27168722 PMCID: PMC4856181 DOI: 10.4137/bbi.s38419] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Revised: 03/13/2016] [Accepted: 03/16/2016] [Indexed: 12/31/2022] Open
Abstract
Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science community. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvantages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.
Collapse
Affiliation(s)
- Stefan Naulaerts
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium.; Biomedical Informatics Research Center Antwerpen (Biomina), University of Antwerp/Antwerp University Hospital, Antwerp, Belgium
| | - Sandy Moens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kristof Engelen
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all'Adige, Trento, Italy
| | - Wim Vanden Berghe
- Department of Biomedical Sciences, Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), University of Antwerp, Antwerp, Belgium
| | - Bart Goethals
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium.; Biomedical Informatics Research Center Antwerpen (Biomina), University of Antwerp/Antwerp University Hospital, Antwerp, Belgium
| | - Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium.; Biomedical Informatics Research Center Antwerpen (Biomina), University of Antwerp/Antwerp University Hospital, Antwerp, Belgium
| |
Collapse
|
12
|
Cheng CP, DeBoever C, Frazer KA, Liu YC, Tseng VS. MiningABs: mining associated biomarkers across multi-connected gene expression datasets. BMC Bioinformatics 2014; 15:173. [PMID: 24909518 PMCID: PMC4068973 DOI: 10.1186/1471-2105-15-173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 06/03/2014] [Indexed: 11/12/2022] Open
Abstract
Background Human disease often arises as a consequence of alterations in a set of associated genes rather than alterations to a set of unassociated individual genes. Most previous microarray-based meta-analyses identified disease-associated genes or biomarkers independent of genetic interactions. Therefore, in this study, we present the first meta-analysis method capable of taking gene combination effects into account to efficiently identify associated biomarkers (ABs) across different microarray platforms. Results We propose a new meta-analysis approach called MiningABs to mine ABs across different array-based datasets. The similarity between paired probe sequences is quantified as a bridge to connect these datasets together. The ABs can be subsequently identified from an “improved” common logit model (c-LM) by combining several sibling-like LMs in a heuristic genetic algorithm selection process. Our approach is evaluated with two sets of gene expression datasets: i) 4 esophageal squamous cell carcinoma and ii) 3 hepatocellular carcinoma datasets. Based on an unbiased reciprocal test, we demonstrate that each gene in a group of ABs is required to maintain high cancer sample classification accuracy, and we observe that ABs are not limited to genes common to all platforms. Investigating the ABs using Gene Ontology (GO) enrichment, literature survey, and network analyses indicated that our ABs are not only strongly related to cancer development but also highly connected in a diverse network of biological interactions. Conclusions The proposed meta-analysis method called MiningABs is able to efficiently identify ABs from different independently performed array-based datasets, and we show its validity in cancer biology via GO enrichment, literature survey and network analyses. We postulate that the ABs may facilitate novel target and drug discovery, leading to improved clinical treatment. Java source code, tutorial, example and related materials are available at “http://sourceforge.net/projects/miningabs/”.
Collapse
Affiliation(s)
| | | | - Kelly A Frazer
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan.
| | | | | |
Collapse
|