1
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
2
|
Islam M, Behura SK. Role of caveolin-1 in metabolic programming of fetal brain. iScience 2023; 26:107710. [PMID: 37720105 PMCID: PMC10500482 DOI: 10.1016/j.isci.2023.107710] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 05/10/2023] [Accepted: 08/23/2023] [Indexed: 09/19/2023] Open
Abstract
Mice lacking caveolin-1 (Cav1), a key protein of plasma membrane, exhibit brain aging at an early adult stage. Here, integrative analyses of metabolomics, transcriptomics, epigenetics, and single-cell data were performed to test the hypothesis that metabolic deregulation of fetal brain due to the ablation of Cav1 is linked to brain aging in these mice. The results of this study show that lack of Cav1 caused deregulation in the lipid and amino acid metabolism in the fetal brain, and genes associated with these deregulated metabolites were significantly altered in the brain upon aging. Moreover, ablation of Cav1 deregulated several metabolic genes in specific cell types of the fetal brain and impacted DNA methylation of those genes in coordination with mouse epigenetic clock. The findings of this study suggest that the aging program of brain is confounded by metabolic abnormalities in the fetal stage due to the absence of Cav1.
Collapse
Affiliation(s)
- Maliha Islam
- Division of Animal Sciences, 920 East Campus Drive, University of Missouri, Columbia, MO 65211, USA
| | - Susanta K. Behura
- Division of Animal Sciences, 920 East Campus Drive, University of Missouri, Columbia, MO 65211, USA
- MU Institute for Data Science and Informatics, University of Missouri, Columbia, MO, USA
- Interdisciplinary Reproduction and Health Group, University of Missouri, Columbia, MO, USA
- Interdisciplinary Neuroscience Program, University of Missouri, Columbia, MO, USA
| |
Collapse
|
3
|
Epigenetic regulation of fetal brain development in pig. Gene 2022; 844:146823. [PMID: 35988784 DOI: 10.1016/j.gene.2022.146823] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/27/2022] [Accepted: 08/15/2022] [Indexed: 02/01/2023]
Abstract
How fetal brain development is regulated at the molecular level is not well understood. Due to ethical challenges associated with research on the human fetus, large animals particularly pigs are increasingly used to study development and disorders of fetal brain. The pig fetal brain grows rapidly during the last ∼ 50 days before birth which is around day 60 (d60) of pig gestation. But what regulates the onset of accelerated growth of the brain is unknown. The current study tests the hypothesis that epigenetic alteration around d60 is involved in the onset of rapid growth of fetal brain of pig. To test this hypothesis, DNA methylation changes of fetal brain was assessed in a genome-wide manner by Enzymatic Methyl-seq (EM-seq) during two gestational periods (GP): d45 vs. d60 (GP1) and d60 vs. d90 (GP2). The cytosine-guanine (CpG) methylation data was analyzed in an integrative manner with the RNA-seq data generated from the same brain samples from our earlier study. A neural network based modeling approach was implemented to learn changes in methylation patterns of the differentially expressed genes, and then predict methylations of the brain in a genome-wide manner during rapid growth. This approach identified specific methylations that changed in a mutually informative manner during rapid growth of the fetal brain. These methylations were significantly overrepresented in specific genic as well as intergenic features including CpG islands, introns, and untranslated regions. In addition, sex-bias methylations of known single nucleotide polymorphic sites were also identified in the fetal brain ide during rapid growth.
Collapse
|
4
|
Fan W, Bouguila N, Du JX, Liu X. Axially Symmetric Data Clustering Through Dirichlet Process Mixture Models of Watson Distributions. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:1683-1694. [PMID: 30369452 DOI: 10.1109/tnnls.2018.2872986] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper proposes a Bayesian nonparametric framework for clustering axially symmetric data. Our approach is based on a Dirichlet processes mixture model with Watson distributions, which can also be considered as the infinite Watson mixture model. In this paper, first, we extend the finite Watson mixture model into its infinite counterpart based on the framework of truncated Dirichlet process mixture model with a stick-breaking representation. Second, we propose a coordinate ascent mean-field variational inference algorithm that can effectively learn the parameters of our model with closed-form solutions; Third, to cope with a massive data set, we develop a stochastic variational inference algorithm to learn the proposed model through the method of stochastic gradient ascent; Finally, the proposed nonparametric Bayesian model is evaluated through simulated axially symmetric data sets and a real-world application, namely, gene expression data clustering.
Collapse
|
5
|
Nepomuceno JA, Troncoso A, Nepomuceno-Chamorro IA, Aguilar-Ruiz JS. Pairwise gene GO-based measures for biclustering of high-dimensional expression data. BioData Min 2018; 11:4. [PMID: 29610579 PMCID: PMC5872503 DOI: 10.1186/s13040-018-0165-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Accepted: 03/01/2018] [Indexed: 11/15/2022] Open
Abstract
Background Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. Results The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. Conclusions It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.
Collapse
Affiliation(s)
- Juan A Nepomuceno
- 1Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, Seville, 41012 Spain
| | - Alicia Troncoso
- 2Área de Informática, Universidad Pablo de Olavide, Ctra. Utrera km. 1, Seville, 41013 Spain
| | - Isabel A Nepomuceno-Chamorro
- 1Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, Seville, 41012 Spain
| | - Jesús S Aguilar-Ruiz
- 2Área de Informática, Universidad Pablo de Olavide, Ctra. Utrera km. 1, Seville, 41013 Spain
| |
Collapse
|
6
|
Biclustering by sparse canonical correlation analysis. QUANTITATIVE BIOLOGY 2018. [DOI: 10.1007/s40484-017-0127-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
7
|
Bhattacharya A, Cui Y. A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules. Sci Rep 2017. [PMID: 28646174 PMCID: PMC5482832 DOI: 10.1038/s41598-017-04070-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
In the analysis of large-scale gene expression data, it is important to identify groups of genes with common expression patterns under certain conditions. Many biclustering algorithms have been developed to address this problem. However, comprehensive discovery of functionally coherent biclusters from large datasets remains a challenging problem. Here we propose a GPU-accelerated biclustering algorithm, based on searching for the largest Condition-dependent Correlation Subgroups (CCS) for each gene in the gene expression dataset. We compared CCS with thirteen widely used biclustering algorithms. CCS consistently outperformed all the thirteen biclustering algorithms on both synthetic and real gene expression datasets. As a correlation-based biclustering method, CCS can also be used to find condition-dependent coexpression network modules. We implemented the CCS algorithm using C and implemented the parallelized CCS algorithm using CUDA C for GPU computing. The source code of CCS is available from https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS.
Collapse
Affiliation(s)
- Anindya Bhattacharya
- Department of Microbiology, Immunology and Biochemistry, Memphis, TN, 38163, USA. .,Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA. .,Department of Computer Science and Engineering, University of California, San Diego, CA, 92093, USA.
| | - Yan Cui
- Department of Microbiology, Immunology and Biochemistry, Memphis, TN, 38163, USA. .,Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
| |
Collapse
|
8
|
Nepomuceno JA, Troncoso A, Aguilar-Ruiz JS. Scatter search-based identification of local patterns with positive and negative correlations in gene expression data. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.06.019] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
9
|
Nepomuceno JA, Troncoso A, Nepomuceno-Chamorro IA, Aguilar-Ruiz JS. Integrating biological knowledge based on functional annotations for biclustering of gene expression data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015; 119:163-80. [PMID: 25843807 DOI: 10.1016/j.cmpb.2015.02.010] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2014] [Revised: 02/17/2015] [Accepted: 02/27/2015] [Indexed: 05/06/2023]
Abstract
Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows the integration of biological information from several sources and can be extended to other biclustering algorithms based on the optimization of a merit function.
Collapse
Affiliation(s)
- Juan A Nepomuceno
- Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, 41012 Seville, Spain.
| | - Alicia Troncoso
- Department of Computer Engineering, Pablo de Olavide University, Ctra. Utrera km. 1, 41013 Seville, Spain
| | - Isabel A Nepomuceno-Chamorro
- Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, 41012 Seville, Spain
| | - Jesús S Aguilar-Ruiz
- Department of Computer Engineering, Pablo de Olavide University, Ctra. Utrera km. 1, 41013 Seville, Spain
| |
Collapse
|
10
|
Deveci M, Küçüktunç O, Eren K, Bozdağ D, Kaya K, Çatalyürek ÜV. Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering. Methods Mol Biol 2015; 1375:55-74. [PMID: 26626937 DOI: 10.1007/7651_2015_246] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Rapid development and increasing popularity of gene expression microarrays have resulted in a number of studies on the discovery of co-regulated genes. One important way of discovering such co-regulations is the query-based search since gene co-expressions may indicate a shared role in a biological process. Although there exist promising query-driven search methods adapting clustering, they fail to capture many genes that function in the same biological pathway because microarray datasets are fraught with spurious samples or samples of diverse origin, or the pathways might be regulated under only a subset of samples. On the other hand, a class of clustering algorithms known as biclustering algorithms which simultaneously cluster both the items and their features are useful while analyzing gene expression data, or any data in which items are related in only a subset of their samples. This means that genes need not be related in all samples to be clustered together. Because many genes only interact under specific circumstances, biclustering may recover the relationships that traditional clustering algorithms can easily miss. In this chapter, we briefly summarize the literature using biclustering for querying co-regulated genes. Then we present a novel biclustering approach and evaluate its performance by a thorough experimental analysis.
Collapse
Affiliation(s)
- Mehmet Deveci
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - Onur Küçüktunç
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - Kemal Eren
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - Doruk Bozdağ
- Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Kamer Kaya
- Computer Science and Engineering, Sabancı University, Istanbul, Turkey
| | - Ümit V Çatalyürek
- Biomedical Informatics, Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
11
|
Horta D, Campello RJGB. Similarity Measures for Comparing Biclusterings. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:942-954. [PMID: 26356865 DOI: 10.1109/tcbb.2014.2325016] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The comparison of ordinary partitions of a set of objects is well established in the clustering literature, which comprehends several studies on the analysis of the properties of similarity measures for comparing partitions. However, similarity measures for clusterings are not readily applicable to biclusterings, since each bicluster is a tuple of two sets (of rows and columns), whereas a cluster is only a single set (of rows). Some biclustering similarity measures have been defined as minor contributions in papers which primarily report on proposals and evaluation of biclustering algorithms or comparative analyses of biclustering algorithms. The consequence is that some desirable properties of such measures have been overlooked in the literature. We review 14 biclustering similarity measures. We define eight desirable properties of a biclustering measure, discuss their importance, and prove which properties each of the reviewed measures has. We show examples drawn and inspired from important studies in which several biclustering measures convey misleading evaluations due to the absence of one or more of the discussed properties. We also advocate the use of a more general comparison approach that is based on the idea of transforming the original problem of comparing biclusterings into an equivalent problem of comparing clustering partitions with overlapping clusters.
Collapse
|
12
|
Pei Y, Gao Q, Li J, Zhao X. Identifying local co-regulation relationships in gene expression data. J Theor Biol 2014; 360:200-207. [PMID: 25042175 DOI: 10.1016/j.jtbi.2014.06.032] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2013] [Accepted: 06/26/2014] [Indexed: 11/24/2022]
Abstract
Identifying interesting relationships between pairs of genes, presented over some of experimental conditions in gene expression data set, is useful for discovering novel functional gene interactions. In this paper, we introduce a new method for id entifying L ocal C o-regulation R elationships (IdLCR). These local relationships describe the behaviors of pairwise genes, which are either up- or down-regulated throughout the identified condition subset. IdLCR firstly detects the pairwise gene-gene relationships taking functional forms and the condition subsets by using a regression spline model. Then it measures the relationships using a penalized Pearson correlation and ranks the responding gene pairs by their scores. By this way, those relationships without clearly biological interpretations can be filtered out and the local co-regulation relationships can be obtained. In the simulation data sets, ten different functional relationships are embedded. Applying IdLCR to these data sets, the results show its ability to identify functional relationships and the condition subsets. For micro-array and RNA-seq gene expression data, IdLCR can identify novel biological relationships which are different from those uncovered by IFGR and MINE.
Collapse
Affiliation(s)
- Yonggang Pei
- College of Mathematics and Information Science, Henan Normal University, Xinxiang 453007, China.
| | - Qinghui Gao
- College of Mathematics and Information Science, Henan Normal University, Xinxiang 453007, China.
| | - Juntao Li
- College of Mathematics and Information Science, Henan Normal University, Xinxiang 453007, China.
| | - Xiting Zhao
- College of Life Science, Henan Normal University, Xinxiang 453007, China.
| |
Collapse
|
13
|
Flores JL, Inza I, Larrañaga P, Calvo B. A new measure for gene expression biclustering based on non-parametric correlation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2013; 112:367-397. [PMID: 24079964 DOI: 10.1016/j.cmpb.2013.07.025] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2012] [Revised: 06/14/2013] [Accepted: 07/26/2013] [Indexed: 06/02/2023]
Abstract
BACKGROUND One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean squared residue as quality measure but relevant and interesting patterns can not be detected such as shifting, or scaling patterns. Furthermore, recent papers show that there exist new coherence patterns involved in different kinds of cancer and tumors such as inverse relationships between genes which can not be captured. RESULTS The proposed measure is called Spearman's biclustering measure (SBM) which performs an estimation of the quality of a bicluster based on the non-linear correlation among genes and conditions simultaneously. The search of biclusters is performed by using a evolutionary technique called estimation of distribution algorithms which uses the SBM measure as fitness function. This approach has been examined from different points of view by using artificial and real microarrays. The assessment process has involved the use of quality indexes, a set of bicluster patterns of reference including new patterns and a set of statistical tests. It has been also examined the performance using real microarrays and comparing to different algorithmic approaches such as Bimax, CC, OPSM, Plaid and xMotifs. CONCLUSIONS SBM shows several advantages such as the ability to recognize more complex coherence patterns such as shifting, scaling and inversion and the capability to selectively marginalize genes and conditions depending on the statistical significance.
Collapse
Affiliation(s)
- Jose L Flores
- Intelligent Systems Group, Department of Computer Sciences and Artificial Intelligence, University of the Basque Country, P.O. Box 649, 20080 Donostia - San Sebastian, Spain.
| | | | | | | |
Collapse
|
14
|
Nosova E, Napolitano F, Amato R, Cocozza S, Miele G, Raiconi G, Tagliaferri R. An improved combinatorial biclustering algorithm. Neural Comput Appl 2013. [DOI: 10.1007/s00521-012-0902-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
15
|
Yun T, Yi GS. Biclustering for the comprehensive search of correlated gene expression patterns using clustered seed expansion. BMC Genomics 2013; 14:144. [PMID: 23496895 PMCID: PMC3618306 DOI: 10.1186/1471-2164-14-144] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2012] [Accepted: 02/21/2013] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND In a functional analysis of gene expression data, biclustering method can give crucial information by showing correlated gene expression patterns under a subset of conditions. However, conventional biclustering algorithms still have some limitations to show comprehensive and stable outputs. RESULTS We propose a novel biclustering approach called "BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC)" to find comprehensive sets of correlated expression patterns in biclusters using clustered seeds and their expansion with correlation of gene expression. BICLIC outperformed competing biclustering algorithms by completely recovering implanted biclusters in simulated datasets with various types of correlated patterns: shifting, scaling, and shifting-scaling. Furthermore, in a real yeast microarray dataset and a lung cancer microarray dataset, BICLIC found more comprehensive sets of biclusters that are significantly enriched to more diverse sets of biological terms than those of other competing biclustering algorithms. CONCLUSIONS BICLIC provides significant benefits in finding comprehensive sets of correlated patterns and their functional implications from a gene expression dataset.
Collapse
Affiliation(s)
- Taegyun Yun
- Department of Information and Communications Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 305-701, Republic of Korea
| | - Gwan-Su Yi
- Department of Information and Communications Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 305-701, Republic of Korea
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 305-701, Republic of Korea
| |
Collapse
|
16
|
Gao Q, Ho C, Jia Y, Li JJ, Huang H. Biclustering of linear patterns in gene expression data. J Comput Biol 2012; 19:619-31. [PMID: 22697238 DOI: 10.1089/cmb.2012.0032] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. Supplementary Material is available online (at www.liebertonline.com/cmb).
Collapse
Affiliation(s)
- Qinghui Gao
- Seventh Research Division and Department of Systems and Control, Beihang University, Beijing China
| | | | | | | | | |
Collapse
|
17
|
Sun J, Garibaldi JM, Kenobi K. Robust Bayesian clustering for replicated gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1504-1514. [PMID: 22641714 DOI: 10.1109/tcbb.2012.85] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Experimental scientific data sets, especially biology data, usually contain replicated measurements. The replicated measurements for the same object are correlated, and this correlation must be carefully dealt with in scientific analysis. In this paper, we propose a robust Bayesian mixture model for clustering data sets with replicated measurements. The model aims not only to accurately cluster the data points taking the replicated measurements into consideration, but also to find the outliers (i.e., scattered objects) which are possibly required to be studied further. A tree-structured variational Bayes (VB) algorithm is developed to carry out model fitting. Experimental studies showed that our model compares favorably with the infinite Gaussian mixture model, while maintaining computational simplicity. We demonstrate the benefits of including the replicated measurements in the model, in terms of improved outlier detection rates in varying measurement uncertainty conditions. Finally, we apply the approach to clustering biological transcriptomics mRNA expression data sets with replicated measurements.
Collapse
Affiliation(s)
- Jianyong Sun
- Centre for Plant Integrative Biology (CPIB), School of Bioscience, The University of Nottingham, Sutton Bonington.
| | | | | |
Collapse
|
18
|
Bhattacharya A, De RK. A novel noise handling method to improve clustering of gene expression patterns. BMC Bioinformatics 2011. [PMCID: PMC3194212 DOI: 10.1186/1471-2105-12-s7-a3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
19
|
Rodriguez-Baena DS, Perez-Pulido AJ, Aguilar-Ruiz JS. A biclustering algorithm for extracting bit-patterns from binary datasets. ACTA ACUST UNITED AC 2011; 27:2738-45. [PMID: 21824973 DOI: 10.1093/bioinformatics/btr464] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. RESULTS A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. AVAILABILITY The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html CONTACT dsrodbae@upo.es SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
20
|
Sill M, Kaiser S, Benner A, Kopp-Schneider A. Robust biclustering by sparse singular value decomposition incorporating stability selection. ACTA ACUST UNITED AC 2011; 27:2089-97. [PMID: 21636597 DOI: 10.1093/bioinformatics/btr322] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Over the past decade, several biclustering approaches have been published in the field of gene expression data analysis. Despite of huge diversity regarding the mathematical concepts of the different biclustering methods, many of them can be related to the singular value decomposition (SVD). Recently, a sparse SVD approach (SSVD) has been proposed to reveal biclusters in gene expression data. In this article, we propose to incorporate stability selection to improve this method. Stability selection is a subsampling-based variable selection that allows to control Type I error rates. The here proposed S4VD algorithm incorporates this subsampling approach to find stable biclusters, and to estimate the selection probabilities of genes and samples to belong to the biclusters. RESULTS So far, the S4VD method is the first biclustering approach that takes the cluster stability regarding perturbations of the data into account. Application of the S4VD algorithm to a lung cancer microarray dataset revealed biclusters that correspond to coregulated genes associated with cancer subtypes. Marker genes for different lung cancer subtypes showed high selection probabilities to belong to the corresponding biclusters. Moreover, the genes associated with the biclusters belong to significantly enriched cancer-related Gene Ontology categories. In a simulation study, the S4VD algorithm outperformed the SSVD algorithm and two other SVD-related biclustering methods in recovering artificial biclusters and in being robust to noisy data. AVAILABILITY R-Code of the S4VD algorithm as well as a documentation can be found at http://s4vd.r-forge.r-project.org/.
Collapse
Affiliation(s)
- Martin Sill
- Division of Biostatistics, DKFZ, 69120 Heidelberg, Germany.
| | | | | | | |
Collapse
|
21
|
Nepomuceno JA, Troncoso A, Aguilar-Ruiz JS. Biclustering of gene expression data by correlation-based scatter search. BioData Min 2011; 4:3. [PMID: 21261986 PMCID: PMC3037342 DOI: 10.1186/1756-0381-4-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2010] [Accepted: 01/24/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. METHODS Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. RESULTS The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database.
Collapse
Affiliation(s)
- Juan A Nepomuceno
- Dpt. Lenguajes y Sistemas Informáticos, ETSII, University of Seville, Avd. Reina Mercedes s/n, 41012, Seville, Spain
| | - Alicia Troncoso
- Department of Computer Science, School of Engineering, Pablo de Olavide University, Ctra. Utrera km. 1, 41013, Seville, Spain
| | - Jesús S Aguilar-Ruiz
- Department of Computer Science, School of Engineering, Pablo de Olavide University, Ctra. Utrera km. 1, 41013, Seville, Spain
| |
Collapse
|