1
|
Abstract
BACKGROUND High throughput methods, in biological and biomedical fields, acquire a large number of molecular parameters or omics data by a single experiment. Combining these omics data can significantly increase the capability for recovering fine-tuned structures or reducing the effects of experimental and biological noise in data. RESULTS In this work we propose a multi-view integration methodology (named FH-Clust) for identifying patient subgroups from different omics information (e.g., Gene Expression, Mirna Expression, Methylation). In particular, hierarchical structures of patient data are obtained in each omic (or view) and finally their topologies are merged by consensus matrix. One of the main aspects of this methodology, is the use of a measure of dissimilarity between sets of observations, by using an appropriate metric. For each view, a dendrogram is obtained by using a hierarchical clustering based on a fuzzy equivalence relation with Łukasiewicz valued fuzzy similarity. Finally, a consensus matrix, that is a representative information of all dendrograms, is formed by combining multiple hierarchical agglomerations by an approach based on transitive consensus matrix construction. Several experiments and comparisons are made on real data (e.g., Glioblastoma, Prostate Cancer) to assess the proposed approach. CONCLUSIONS Fuzzy logic allows us to introduce more flexible data agglomeration techniques. From the analysis of scientific literature, it appears to be the first time that a model based on fuzzy logic is used for the agglomeration of multi-omic data. The results suggest that FH-Clust provides better prognostic value and clinical significance compared to the analysis of single-omic data alone and it is very competitive with respect to other techniques from literature.
Collapse
Affiliation(s)
- Angelo Ciaramella
- Dipartimento di Scienze e Tecnologie, Università degli Studi di Napoli “Parthenope”, Centro Direzionale, C4 Island, Naples, 80143 Italy
| | | | - Antonino Staiano
- Dipartimento di Scienze e Tecnologie, Università degli Studi di Napoli “Parthenope”, Centro Direzionale, C4 Island, Naples, 80143 Italy
| |
Collapse
|
2
|
Belciug S. Pathologist at work. Artif Intell Cancer 2020. [DOI: 10.1016/b978-0-12-820201-2.00003-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
|
3
|
Lambrou GI, Sdraka M, Koutsouris D. The “Gene Cube”: A Novel Approach to Three-dimensional Clustering of Gene Expression Data. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190116170406] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
A very popular technique for isolating significant genes from cancerous
tissues is the application of various clustering algorithms on data obtained by DNA microarray experiments.
Aim:
The objective of the present work is to take into consideration the chromosomal identity of
every gene before the clustering, by creating a three-dimensional structure of the form Chromosomes×Genes×Samples.
Further on, the k-Means algorithm and a triclustering technique called δ-
TRIMAX, are applied independently on the structure.
Materials and Methods:
The present algorithm was developed using the Python programming
language (v. 3.5.1). For this work, we used two distinct public datasets containing healthy control
samples and tissue samples from bladder cancer patients. Background correction was performed
by subtracting the median global background from the median local Background from the signal
intensity. The quantile normalization method has been applied for sample normalization. Three
known algorithms have been applied for testing the “gene cube”, a classical k-means, a transformed
3D k-means and the δ-TRIMAX.
Results:
Our proposed data structure consists of a 3D matrix of the form Chromosomes×Genes×Samples.
Clustering analysis of that structure manifested very good results as we
were able to identify gene expression patterns among samples, genes and chromosomes. Discussion:
to the best of our knowledge, this is the first time that such a structure is reported and it consists
of a useful tool towards gene classification from high-throughput gene expression experiments.
Conclusion:
Such approaches could prove useful towards the understanding of disease mechanics
and tumors in particular.
Collapse
Affiliation(s)
- George I. Lambrou
- National Technical University of Athens, School of Electrical and Computer Engineering, Biomedical Engineering Laboratory, Heroon Polytecniou 9, Athens, 15780, Athens, Greece
| | - Maria Sdraka
- National Technical University of Athens, School of Electrical and Computer Engineering, Biomedical Engineering Laboratory, Heroon Polytecniou 9, Athens, 15780, Athens, Greece
| | - Dimitrios Koutsouris
- National Technical University of Athens, School of Electrical and Computer Engineering, Biomedical Engineering Laboratory, Heroon Polytecniou 9, Athens, 15780, Athens, Greece
| |
Collapse
|
4
|
Nardone D, Ciaramella A, Staiano A. A Sparse-Modeling Based Approach for Class Specific Feature Selection. PeerJ Comput Sci 2019; 5:e237. [PMID: 33816890 PMCID: PMC7924712 DOI: 10.7717/peerj-cs.237] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 10/20/2019] [Indexed: 05/25/2023]
Abstract
In this work, we propose a novel Feature Selection framework called Sparse-Modeling Based Approach for Class Specific Feature Selection (SMBA-CSFS), that simultaneously exploits the idea of Sparse Modeling and Class-Specific Feature Selection. Feature selection plays a key role in several fields (e.g., computational biology), making it possible to treat models with fewer variables which, in turn, are easier to explain, by providing valuable insights on the importance of their role, and likely speeding up the experimental validation. Unfortunately, also corroborated by the no free lunch theorems, none of the approaches in literature is the most apt to detect the optimal feature subset for building a final model, thus it still represents a challenge. The proposed feature selection procedure conceives a two-step approach: (a) a sparse modeling-based learning technique is first used to find the best subset of features, for each class of a training set; (b) the discovered feature subsets are then fed to a class-specific feature selection scheme, in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built, where each classifier is trained on its own feature subset discovered in the previous phase, and a proper decision rule is adopted to compute the ensemble responses. In order to evaluate the performance of the proposed method, extensive experiments have been performed on publicly available datasets, in particular belonging to the computational biology field where feature selection is indispensable: the acute lymphoblastic leukemia and acute myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large B-cell lymphoma, and the malignant glioma. SMBA-CSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, SMBA-CSFS exhibits a promising performance when compared to its competitors from literature, on all considered datasets, especially those with a higher number of features. Experiments show that the proposed approach may outperform the state-of-the-art methods when the number of features is high. For this reason, the introduced approach proposes itself for selection and classification of data with a large number of features and classes.
Collapse
Affiliation(s)
- Davide Nardone
- Dipartimento di Scienze e Tecnologie, Università degli Studi di Napoli “Parthenope”, Naples, Italy
| | - Angelo Ciaramella
- Dipartimento di Scienze e Tecnologie, Università degli Studi di Napoli “Parthenope”, Naples, Italy
| | - Antonino Staiano
- Dipartimento di Scienze e Tecnologie, Università degli Studi di Napoli “Parthenope”, Naples, Italy
| |
Collapse
|
5
|
Qu Z, Lau CW, Nguyen QV, Zhou Y, Catchpoole DR. Visual Analytics of Genomic and Cancer Data: A Systematic Review. Cancer Inform 2019; 18:1176935119835546. [PMID: 30890859 PMCID: PMC6416684 DOI: 10.1177/1176935119835546] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Accepted: 01/29/2019] [Indexed: 12/12/2022] Open
Abstract
Visual analytics and visualisation can leverage the human perceptual system to
interpret and uncover hidden patterns in big data. The advent of next-generation
sequencing technologies has allowed the rapid production of massive amounts of
genomic data and created a corresponding need for new tools and methods for
visualising and interpreting these data. Visualising genomic data requires not
only simply plotting of data but should also offer a decision or a choice about
what the message should be conveyed in the particular plot; which methodologies
should be used to represent the results must provide an easy, clear, and
accurate way to the clinicians, experts, or researchers to interact with the
data. Genomic data visual analytics is rapidly evolving in parallel with
advances in high-throughput technologies such as artificial intelligence (AI)
and virtual reality (VR). Personalised medicine requires new genomic
visualisation tools, which can efficiently extract knowledge from the genomic
data and speed up expert decisions about the best treatment of individual
patient’s needs. However, meaningful visual analytics of such large genomic data
remains a serious challenge. This article provides a comprehensive systematic
review and discussion on the tools, methods, and trends for visual analytics of
cancer-related genomic data. We reviewed methods for genomic data visualisation
including traditional approaches such as scatter plots, heatmaps, coordinates,
and networks, as well as emerging technologies using AI and VR. We also
demonstrate the development of genomic data visualisation tools over time and
analyse the evolution of visualising genomic data.
Collapse
Affiliation(s)
- Zhonglin Qu
- School of Computing, Engineering and Mathematics, Western Sydney University, Penrith, NSW, Australia
| | - Chng Wei Lau
- School of Computing, Engineering and Mathematics, Western Sydney University, Penrith, NSW, Australia
| | - Quang Vinh Nguyen
- School of Computing, Engineering and Mathematics, Western Sydney University, Penrith, NSW, Australia.,The MARCS Institute, Western Sydney University, Penrith, NSW, Australia
| | - Yi Zhou
- School of Computing, Engineering and Mathematics, Western Sydney University, Penrith, NSW, Australia
| | - Daniel R Catchpoole
- The Tumour Bank, Children's Cancer Research Unit, Kids Research, The Children's Hospital at Westmead, Westmead, NSW, Australia.,Discipline of Paediatrics and Child Health, Faculty of Medicine, The University of Sydney, Sydney, NSW, Australia.,Faculty of Information Technology, The University of Technology Sydney, Ultimo, NSW, Australia
| |
Collapse
|
6
|
|
7
|
Riccio A, Ciaramella A, Giunta G, Galmarini S, Solazzo E, Potempski S. On the systematic reduction of data complexity in multimodel atmospheric dispersion ensemble modeling. ACTA ACUST UNITED AC 2012. [DOI: 10.1029/2011jd016503] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
8
|
Nahar J, Tickle KS, Shawkat Ali AB. Pattern Discovery from Biological Data. Mach Learn 2012. [DOI: 10.4018/978-1-60960-818-7.ch403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Extracting useful information from structured and unstructured biological data is crucial in the health industry. Some examples include medical practitioner’s need to identify breast cancer patient in the early stage, estimate survival time of a heart disease patient, or recognize uncommon disease characteristics which suddenly appear. Currently there is an explosion in biological data available in the data bases. But information extraction and true open access to data are require time to resolve issues such as ethical clearance. The emergence of novel IT technologies allows health practitioners to facilitate the comprehensive analyses of medical images, genomes, transcriptomes, and proteomes in health and disease. The information that is extracted from such technologies may soon exert a dramatic change in the pace of medical research and impact considerably on the care of patients. The current research will review the existing technologies being used in heart and cancer research. Finally this research will provide some possible solutions to overcome the limitations of existing technologies. In summary the primary objective of this research is to investigate how existing modern machine learning techniques (with their strength and limitations) are being used in the indent of heartbeat related disease and the early detection of cancer in patients. After an extensive literature review these are the objectives chosen: to develop a new approach to find the association between diseases such as high blood pressure, stroke and heartbeat, to propose an improved feature selection method to analyze huge images and microarray databases for machine learning algorithms in cancer research, to find an automatic distance function selection method for clustering tasks, to discover the most significant risk factors for specific cancers, and to determine the preventive factors for specific cancers that are aligned with the most significant risk factors. Therefore we propose a research plan to attain these objectives within this chapter. The possible solutions of the above objectives are: new heartbeat identification techniques show promising association with the heartbeat patterns and diseases, sensitivity based feature selection methods will be applied to early cancer patient classification, meta learning approaches will be adopted in clustering algorithms to select an automatic distance function, and Apriori algorithm will be applied to discover the significant risks and preventive factors for specific cancers. We expect this research will add significant contributions to the medical professional to enable more accurate diagnosis and better patient care. It will also contribute in other area such as biomedical modeling, medical image analysis and early diseases warning.
Collapse
|
9
|
|
10
|
Lisboa P, Vellido A, Tagliaferri R, Napolitano F, Ceccarelli M, Martin-Guerrero J, Biganzoli E. Data Mining in Cancer Research [Application Notes. IEEE COMPUT INTELL M 2010. [DOI: 10.1109/mci.2009.935311] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
11
|
López-Rubio E. Multivariate Student- self-organizing maps. Neural Netw 2009; 22:1432-47. [DOI: 10.1016/j.neunet.2009.05.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2008] [Revised: 05/01/2009] [Accepted: 05/01/2009] [Indexed: 11/25/2022]
|