1
|
Gu Q, Veselkov K. Bi-clustering of metabolic data using matrix factorization tools. Methods 2018; 151:12-20. [PMID: 29438828 PMCID: PMC6297113 DOI: 10.1016/j.ymeth.2018.02.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 02/04/2018] [Accepted: 02/06/2018] [Indexed: 01/08/2023] Open
Abstract
We propose a positive matrix factorization bi-clustering strategy for metabolic data. The approach automatically determines the number and composition of bi-clusters. We demonstrate its superior performance compared to other techniques.
Metabolic phenotyping technologies based on Nuclear Magnetic Spectroscopy (NMR) and Mass Spectrometry (MS) generate vast amounts of unrefined data from biological samples. Clustering strategies are frequently employed to provide insight into patterns of relationships between samples and metabolites. Here, we propose the use of a non-negative matrix factorization driven bi-clustering strategy for metabolic phenotyping data in order to discover subsets of interrelated metabolites that exhibit similar behaviour across subsets of samples. The proposed strategy incorporates bi-cross validation and statistical segmentation techniques to automatically determine the number and structure of bi-clusters. This alternative approach is in contrast to the widely used conventional clustering approaches that incorporate all molecular peaks for clustering in metabolic studies and require a priori specification of the number of clusters. We perform the comparative analysis of the proposed strategy with other bi-clustering approaches, which were developed in the context of genomics and transcriptomics research. We demonstrate the superior performance of the proposed bi-clustering strategy on both simulated (NMR) and real (MS) bacterial metabolic data.
Collapse
Affiliation(s)
- Quan Gu
- MRC-University of Glasgow Centre for Virus Research, University of Glasgow, Garscube Estate, Glasgow G61 1QH, UK
| | - Kirill Veselkov
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, Exhibition Road, South Kensington, London SW7 2AZ, UK.
| |
Collapse
|
2
|
|
3
|
Ye M, Wang Z, Wang Y, Wu R. A multi-Poisson dynamic mixture model to cluster developmental patterns of gene expression by RNA-seq. Brief Bioinform 2014; 16:205-15. [PMID: 24817567 DOI: 10.1093/bib/bbu013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Dynamic changes of gene expression reflect an intrinsic mechanism of how an organism responds to developmental and environmental signals. With the increasing availability of expression data across a time-space scale by RNA-seq, the classification of genes as per their biological function using RNA-seq data has become one of the most significant challenges in contemporary biology. Here we develop a clustering mixture model to discover distinct groups of genes expressed during a period of organ development. By integrating the density function of multivariate Poisson distribution, the model accommodates the discrete property of read counts characteristic of RNA-seq data. The temporal dependence of gene expression is modeled by the first-order autoregressive process. The model is implemented with the Expectation-Maximization algorithm and model selection to determine the optimal number of gene clusters and obtain the estimates of Poisson parameters that describe the pattern of time-dependent expression of genes from each cluster. The model has been demonstrated by analyzing a real data from an experiment aimed to link the pattern of gene expression to catkin development in white poplar. The usefulness of the model has been validated through computer simulation. The model provides a valuable tool for clustering RNA-seq data, facilitating our global view of expression dynamics and understanding of gene regulation mechanisms.
Collapse
|
4
|
Abstract
In this chapter, different methods and applications of biclustering algorithms to DNA microarray data analysis that have been developed in recent years are discussed and compared. Identification of biological significant clusters of genes from microarray experimental data is a very daunting task that emerged, especially with the development of high throughput technologies. Various computational and evaluation methods based on diverse principles were introduced to identify new similarities among genes. Mathematical aspects of the models are highlighted, and applications to solve biological problems are discussed.
Collapse
|
5
|
Xu R, Wunsch II DC. BARTMAP: A viable structure for biclustering. Neural Netw 2011; 24:709-16. [DOI: 10.1016/j.neunet.2011.03.020] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2010] [Revised: 02/14/2011] [Accepted: 03/16/2011] [Indexed: 10/18/2022]
|
6
|
Bellay J, Atluri G, Sing TL, Toufighi K, Costanzo M, Ribeiro PSM, Pandey G, Baller J, VanderSluis B, Michaut M, Han S, Kim P, Brown GW, Andrews BJ, Boone C, Kumar V, Myers CL. Putting genetic interactions in context through a global modular decomposition. Genome Res 2011; 21:1375-87. [PMID: 21715556 DOI: 10.1101/gr.117176.110] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Genetic interactions provide a powerful perspective into gene function, but our knowledge of the specific mechanisms that give rise to these interactions is still relatively limited. The availability of a global genetic interaction map in Saccharomyces cerevisiae, covering ∼30% of all possible double mutant combinations, provides an unprecedented opportunity for an unbiased assessment of the native structure within genetic interaction networks and how it relates to gene function and modular organization. Toward this end, we developed a data mining approach to exhaustively discover all block structures within this network, which allowed for its complete modular decomposition. The resulting modular structures revealed the importance of the context of individual genetic interactions in their interpretation and revealed distinct trends among genetic interaction hubs as well as insights into the evolution of duplicate genes. Block membership also revealed a surprising degree of multifunctionality across the yeast genome and enabled a novel association of VIP1 and IPK1 with DNA replication and repair, which is supported by experimental evidence. Our modular decomposition also provided a basis for testing the between-pathway model of negative genetic interactions and within-pathway model of positive genetic interactions. While we find that most modular structures involving negative genetic interactions fit the between-pathway model, we found that current models for positive genetic interactions fail to explain 80% of the modular structures detected. We also find differences between the modular structures of essential and nonessential genes.
Collapse
Affiliation(s)
- Jeremy Bellay
- Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Saccenti E, Westerhuis JA, Smilde AK, van der Werf MJ, Hageman JA, Hendriks MMWB. Simplivariate models: uncovering the underlying biology in functional genomics data. PLoS One 2011; 6:e20747. [PMID: 21698241 PMCID: PMC3116836 DOI: 10.1371/journal.pone.0020747] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Accepted: 05/12/2011] [Indexed: 12/19/2022] Open
Abstract
One of the first steps in analyzing high-dimensional functional genomics data is an exploratory analysis of such data. Cluster Analysis and Principal Component Analysis are then usually the method of choice. Despite their versatility they also have a severe drawback: they do not always generate simple and interpretable solutions. On the basis of the observation that functional genomics data often contain both informative and non-informative variation, we propose a method that finds sets of variables containing informative variation. This informative variation is subsequently expressed in easily interpretable simplivariate components. We present a new implementation of the recently introduced simplivariate models. In this implementation, the informative variation is described by multiplicative models that can adequately represent the relations between functional genomics data. Both a simulated and two real-life metabolomics data sets show good performance of the method.
Collapse
Affiliation(s)
- Edoardo Saccenti
- Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands.
| | | | | | | | | | | |
Collapse
|
8
|
Sill M, Kaiser S, Benner A, Kopp-Schneider A. Robust biclustering by sparse singular value decomposition incorporating stability selection. ACTA ACUST UNITED AC 2011; 27:2089-97. [PMID: 21636597 DOI: 10.1093/bioinformatics/btr322] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Over the past decade, several biclustering approaches have been published in the field of gene expression data analysis. Despite of huge diversity regarding the mathematical concepts of the different biclustering methods, many of them can be related to the singular value decomposition (SVD). Recently, a sparse SVD approach (SSVD) has been proposed to reveal biclusters in gene expression data. In this article, we propose to incorporate stability selection to improve this method. Stability selection is a subsampling-based variable selection that allows to control Type I error rates. The here proposed S4VD algorithm incorporates this subsampling approach to find stable biclusters, and to estimate the selection probabilities of genes and samples to belong to the biclusters. RESULTS So far, the S4VD method is the first biclustering approach that takes the cluster stability regarding perturbations of the data into account. Application of the S4VD algorithm to a lung cancer microarray dataset revealed biclusters that correspond to coregulated genes associated with cancer subtypes. Marker genes for different lung cancer subtypes showed high selection probabilities to belong to the corresponding biclusters. Moreover, the genes associated with the biclusters belong to significantly enriched cancer-related Gene Ontology categories. In a simulation study, the S4VD algorithm outperformed the SSVD algorithm and two other SVD-related biclustering methods in recovering artificial biclusters and in being robust to noisy data. AVAILABILITY R-Code of the S4VD algorithm as well as a documentation can be found at http://s4vd.r-forge.r-project.org/.
Collapse
Affiliation(s)
- Martin Sill
- Division of Biostatistics, DKFZ, 69120 Heidelberg, Germany.
| | | | | | | |
Collapse
|
9
|
Izenman AJ, Harris PW, Mennis J, Jupin J, Obradovic Z. Local spatial biclustering and prediction of urban juvenile delinquency and recidivism. Stat Anal Data Min 2011. [DOI: 10.1002/sam.10123] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
10
|
Kompass KS, Witte JS. Co-regulatory expression quantitative trait loci mapping: method and application to endometrial cancer. BMC Med Genomics 2011; 4:6. [PMID: 21226949 PMCID: PMC3032645 DOI: 10.1186/1755-8794-4-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2010] [Accepted: 01/12/2011] [Indexed: 01/16/2023] Open
Abstract
Background Expression quantitative trait loci (eQTL) studies have helped identify the genetic determinants of gene expression. Understanding the potential interacting mechanisms underlying such findings, however, is challenging. Methods We describe a method to identify the trans-acting drivers of multiple gene co-expression, which reflects the action of regulatory molecules. This method-termed co-regulatory expression quantitative trait locus (creQTL) mapping-allows for evaluation of a more focused set of phenotypes within a clear biological context than conventional eQTL mapping. Results Applying this method to a study of endometrial cancer revealed regulatory mechanisms supported by the literature: a creQTL between a locus upstream of STARD13/DLC2 and a group of seven IFNβ-induced genes. This suggests that the Rho-GTPase encoded by STARD13 regulates IFNβ-induced genes and the DNA damage response. Conclusions Because of the importance of IFNβ in cancer, our results suggest that creQTL may provide a finer picture of gene regulation and may reveal additional molecular targets for intervention. An open source R implementation of the method is available at http://sites.google.com/site/kenkompass/.
Collapse
Affiliation(s)
- Kenneth S Kompass
- Department of Epidemiology and Biostatistics, Institute for Human Genetics, University of California, San Francisco, USA
| | | |
Collapse
|
11
|
DiMaggio PA, Subramani A, Judson RS, Floudas CA. A novel framework for predicting in vivo toxicities from in vitro data using optimal methods for dense and sparse matrix reordering and logistic regression. Toxicol Sci 2010; 118:251-65. [PMID: 20702588 DOI: 10.1093/toxsci/kfq233] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this work, we combine the strengths of mixed-integer linear optimization (MILP) and logistic regression for predicting the in vivo toxicity of chemicals using only their measured in vitro assay data. The proposed approach utilizes a biclustering method based on iterative optimal reordering (DiMaggio, P. A., McAllister, S. R., Floudas, C. A., Feng, X. J., Rabinowitz, J. D., and Rabitz, H. A. (2008). Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics 9, 458-474.; DiMaggio, P. A., McAllister, S. R., Floudas, C. A., Feng, X. J., Rabinowitz, J. D., and Rabitz, H. A. (2010b). A network flow model for biclustering via optimal re-ordering of data matrices. J. Global. Optim. 47, 343-354.) to identify biclusters corresponding to subsets of chemicals that have similar responses over distinct subsets of the in vitro assays. The biclustering of the in vitro assays is shown to result in significant clustering based on assay target (e.g., cytochrome P450 [CYP] and nuclear receptors) and type (e.g., downregulated BioMAP and biochemical high-throughput screening protein kinase activity assays). An optimal method based on mixed-integer linear optimization for reordering sparse data matrices (DiMaggio, P. A., McAllister, S. R., Floudas, C. A., Feng, X. J., Li, G. Y., Rabinowitz, J. D., and Rabitz, H. A. (2010a). Enhancing molecular discovery using descriptor-free rearrangement clustering techniques for sparse data sets. AIChE J. 56, 405-418.; McAllister, S. R., DiMaggio, P. A., and Floudas, C. A. (2009). Mathematical modeling and efficient optimization methods for the distance-dependent rearrangement clustering problem. J. Global. Optim. 45, 111-129) is then applied to the in vivo data set (21.7% sparse) in order to cluster end points that have similar lowest effect level (LEL) values, where it is observed that the end points are effectively clustered according to (1) animal species (i.e., the chronic mouse and chronic rat end points were clearly separated) and (2) similar physiological attributes (i.e., liver- and reproductive-related end points were found to separately cluster together). As the liver and reproductive end points exhibited the largest degree of correlation, we further analyzed them using regularized logistic regression in a rank-and-drop framework to identify which subset of in vitro features could be utilized for in vivo toxicity prediction. It was observed that the in vivo end points that had similar LEL responses over the 309 chemicals (as determined by the sparse clustering results) also shared a significant subset of selected in vitro descriptors. Comparing the significant descriptors between the two different categories of end points revealed a specificity of the CYP assays for the liver end points and preferential selection of the estrogen/androgen nuclear receptors by the reproductive end points.
Collapse
Affiliation(s)
- Peter A DiMaggio
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey 08544-5263, USA
| | | | | | | |
Collapse
|
12
|
|
13
|
Shabalin AA, Weigman VJ, Perou CM, Nobel AB. Finding large average submatrices in high dimensional data. Ann Appl Stat 2009. [DOI: 10.1214/09-aoas239] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Li A, Horvath S. Network module detection: Affinity search technique with the multi-node topological overlap measure. BMC Res Notes 2009; 2:142. [PMID: 19619323 PMCID: PMC2727520 DOI: 10.1186/1756-0500-2-142] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2008] [Accepted: 07/20/2009] [Indexed: 11/10/2022] Open
Abstract
Background Many clustering procedures only allow the user to input a pairwise dissimilarity or distance measure between objects. We propose a clustering method that can input a multi-point dissimilarity measure d(i1, i2, ..., iP) where the number of points P can be larger than 2. The work is motivated by gene network analysis where clusters correspond to modules of highly interconnected nodes. Here, we define modules as clusters of network nodes with high multi-node topological overlap. The topological overlap measure is a robust measure of interconnectedness which is based on shared network neighbors. In previous work, we have shown that the multi-node topological overlap measure yields biologically meaningful results when used as input of network neighborhood analysis. Findings We adapt network neighborhood analysis for the use of module detection. We propose the Module Affinity Search Technique (MAST), which is a generalized version of the Cluster Affinity Search Technique (CAST). MAST can accommodate a multi-node dissimilarity measure. Clusters grow around user-defined or automatically chosen seeds (e.g. hub nodes). We propose both local and global cluster growth stopping rules. We use several simulations and a gene co-expression network application to argue that the MAST approach leads to biologically meaningful results. We compare MAST with hierarchical clustering and partitioning around medoid clustering. Conclusion Our flexible module detection method is implemented in the MTOM software which can be downloaded from the following webpage:
Collapse
Affiliation(s)
- Ai Li
- Department of Human Genetics, University of California, Los Angeles, CA 90095, USA.
| | | |
Collapse
|
15
|
Chatterjee S, Bhattacharjee K, Konar A. A simple and robust algorithm for microarray data clustering based on gene population-variance ratio metric. Biotechnol J 2009; 4:1357-61. [PMID: 19579218 DOI: 10.1002/biot.200800219] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
With the advent of the microarray technology, the field of life science has been greatly revolutionized, since this technique allows the simultaneous monitoring of the expression levels of thousands of genes in a particular organism. However, the statistical analysis of expression data has its own challenges, primarily because of the huge amount of data that is to be dealt with, and also because of the presence of noise, which is almost an inherent characteristic of microarray data. Clustering is one tool used to mine meaningful patterns from microarray data. In this paper, we present a novel method of clustering yeast microarray data, which is robust and yet simple to implement. It identifies the best clusters from a given dataset on the basis of the population of the clusters as well as the variance of the feature values of the members from the cluster-center. It has been found to yield satisfactory results even in the presence of noisy data.
Collapse
|
16
|
Abstract
In recent years, there has been considerable interest in the analysis of social annotations. Social annotations allow users to annotate web resources more easily, openly and freely than do taxonomies and ontologies. In this paper, we propose a novel algorithm for social annotations. It introduces a fuzzy biclustering algorithm to social annotations for identifying subgroups of users and of resources, and discovering the relationships between those users for social annotations. The algorithm employs a combination of pattern search and compromise programming to construct hierarchically structured biclusters. The pattern search method is used to compute a single objective optimal solution, and the compromise programming is used to trade-off between multiple objectives. The algorithm is not subject to the convexity limitations, and does not need to use the derivative information. It can automatically identify user communities and achieve high prediction accuracies.
Collapse
Affiliation(s)
- Lixin Han
- College of Computer and Information Engineering, Hohai University, Nanjing, Jiangsu, People's Republic of China, , State Key Laboratory of Novel software Technology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Hong Yan
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, School of Electrical and information Engineering, University of Sydney, NSW 2006, Australia
| |
Collapse
|
17
|
Sen M, Chaudhury SS, Konar A, Janarthanan R. An evolutionary gene expression microarray clustering algorithm based on optimized experimental conditions. 2009 WORLD CONGRESS ON NATURE & BIOLOGICALLY INSPIRED COMPUTING (NABIC) 2009. [DOI: 10.1109/nabic.2009.5393872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
|
18
|
Uitert MV, Meuleman W, Wessels L. Biclustering Sparse Binary Genomic Data. J Comput Biol 2008; 15:1329-45. [DOI: 10.1089/cmb.2008.0066] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Miranda van Uitert
- Bioinformatics and Statistics, Division of Molecular Biology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Wouter Meuleman
- Bioinformatics and Statistics, Division of Molecular Biology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Lodewyk Wessels
- Bioinformatics and Statistics, Division of Molecular Biology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| |
Collapse
|
19
|
DiMaggio PA, McAllister SR, Floudas CA, Feng XJ, Rabinowitz JD, Rabitz HA. Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics 2008; 9:458. [PMID: 18954459 PMCID: PMC2605474 DOI: 10.1186/1471-2105-9-458] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2008] [Accepted: 10/27/2008] [Indexed: 11/29/2022] Open
Abstract
Background The analysis of large-scale data sets via clustering techniques is utilized in a number of applications. Biclustering in particular has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Biclustering algorithms also have important applications in sample classification where, for instance, tissue samples can be classified as cancerous or normal. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the "best" grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters. Results In this article, we present a rigorous approach to biclustering, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. Cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices to generate biclusters. The performance of OREO is tested on (a) metabolite concentration data, (b) an image reconstruction matrix, (c) synthetic data with implanted biclusters, and gene expression data for (d) colon cancer data, (e) breast cancer data, as well as (f) yeast segregant data to validate the ability of the proposed method and compare it to existing biclustering and clustering methods. Conclusion We demonstrate that this rigorous global optimization method for biclustering produces clusters with more insightful groupings of similar entities, such as genes or metabolites sharing common functions, than other clustering and biclustering algorithms and can reconstruct underlying fundamental patterns in the data for several distinct sets of data matrices arising in important biological applications.
Collapse
Affiliation(s)
- Peter A DiMaggio
- Department of Chemical Engineering, Princeton University, Princeton, NJ, USA.
| | | | | | | | | | | |
Collapse
|
20
|
Abstract
One of the new expanding areas in functional genomics is metabolomics: measuring the metabolome of an organism. Data being generated in metabolomics studies are very diverse in nature depending on the design underlying the experiment. Traditionally, variation in measurements is conceptually broken down in systematic variation and noise where the latter contains, e.g. technical variation. There is increasing evidence that this distinction does not hold (or is too simple) for metabolomics data. A more useful distinction is in terms of informative and non-informative variation where informative relates to the problem being studied. In most common methods for analyzing metabolomics (or any other high-dimensional x-omics) data this distinction is ignored thereby severely hampering the results of the analysis. This leads to poorly interpretable models and may even obscure the relevant biological information. We developed a framework from first data analysis principles by explicitly formulating the problem of analyzing metabolomics data in terms of informative and non-informative parts. This framework allows for flexible interactions with the biologists involved in formulating prior knowledge of underlying structures. The basic idea is that the informative parts of the complex metabolomics data are approximated by simple components with a biological meaning, e.g. in terms of metabolic pathways or their regulation. Hence, we termed the framework ‘simplivariate models’ which constitutes a new way of looking at metabolomics data. The framework is given in its full generality and exemplified with two methods, IDR analysis and plaid modeling, that fit into the framework. Using this strategy of ‘divide and conquer’, we show that meaningful simplivariate models can be obtained using a real-life microbial metabolomics data set. For instance, one of the simple components contained all the measured intermediates of the Krebs cycle of E. coli. Moreover, these simplivariate models were able to uncover regulatory mechanisms present in the phenylalanine biosynthesis route of E. coli.
Collapse
|
21
|
Nowak G, Tibshirani R. Complementary hierarchical clustering. Biostatistics 2008; 9:467-83. [PMID: 18093965 PMCID: PMC3294318 DOI: 10.1093/biostatistics/kxm046] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2007] [Revised: 11/12/2007] [Accepted: 11/12/2007] [Indexed: 01/07/2023] Open
Abstract
When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.
Collapse
Affiliation(s)
- Gen Nowak
- Department of Statistics, Stanford University, Stanford, CA 94305, USA.
| | | |
Collapse
|
22
|
A visual analytics approach for understanding biclustering results from microarray data. BMC Bioinformatics 2008; 9:247. [PMID: 18505552 PMCID: PMC2416653 DOI: 10.1186/1471-2105-9-247] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2007] [Accepted: 05/27/2008] [Indexed: 11/10/2022] Open
Abstract
Background Microarray analysis is an important area of bioinformatics. In the last few years, biclustering has become one of the most popular methods for classifying data from microarrays. Although biclustering can be used in any kind of classification problem, nowadays it is mostly used for microarray data classification. A large number of biclustering algorithms have been developed over the years, however little effort has been devoted to the representation of the results. Results We present an interactive framework that helps to infer differences or similarities between biclustering results, to unravel trends and to highlight robust groupings of genes and conditions. These linked representations of biclusters can complement biological analysis and reduce the time spent by specialists on interpreting the results. Within the framework, besides other standard representations, a visualization technique is presented which is based on a force-directed graph where biclusters are represented as flexible overlapped groups of genes and conditions. This microarray analysis framework (BicOverlapper), is available at Conclusion The main visualization technique, tested with different biclustering results on a real dataset, allows researchers to extract interesting features of the biclustering results, especially the highlighting of overlapping zones that usually represent robust groups of genes and/or conditions. The visual analytics methodology will permit biology experts to study biclustering results without inspecting an overwhelming number of biclusters individually.
Collapse
|
23
|
Yin ZX, Chiang JH. Novel algorithm for coexpression detection in time-varying microarray data sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:120-135. [PMID: 18245881 DOI: 10.1109/tcbb.2007.1052] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
When analyzing the results of microarray experiments, biologists generally use unsupervised categorization tools. However, such tools regard each time point as an independent dimension and utilize the Euclidean distance to compute the similarities between expressions. Furthermore, some of these methods require the number of clusters to be determined in advance, which is clearly impossible in the case of a new dataset. Therefore, this study proposes a novel scheme, designated as the Variation-based Coexpression Detection (VCD) algorithm, to analyze the trends of expressions based on their variation over time. The proposed algorithm has two advantages. First, it is unnecessary to determine the number of clusters in advance since the algorithm automatically detects those genes whose profiles are grouped together and creates patterns for these groups. Second, the algorithm features a new measurement criterion for calculating the degree of change of the expressions between adjacent time points and evaluating their trend similarities. Three real-world microarray datasets are employed to evaluate the performance of the proposed algorithm.
Collapse
|
24
|
Zhao H, Liew AWC, Xie X, Yan H. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. J Theor Biol 2007; 251:264-74. [PMID: 18199458 DOI: 10.1016/j.jtbi.2007.11.030] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2007] [Revised: 10/17/2007] [Accepted: 11/29/2007] [Indexed: 11/30/2022]
Abstract
Biclustering is an important tool in microarray analysis when only a subset of genes co-regulates in a subset of conditions. Different from standard clustering analyses, biclustering performs simultaneous classification in both gene and condition directions in a microarray data matrix. However, the biclustering problem is inherently intractable and computationally complex. In this paper, we present a new biclustering algorithm based on the geometrical viewpoint of coherent gene expression profiles. In this method, we perform pattern identification based on the Hough transform in a column-pair space. The algorithm is especially suitable for the biclustering analysis of large-scale microarray data. Our studies show that the approach can discover significant biclusters with respect to the increased noise level and regulatory complexity. Furthermore, we also test the ability of our method to locate biologically verifiable biclusters within an annotated set of genes.
Collapse
Affiliation(s)
- Hongya Zhao
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
25
|
Cano C, Adarve L, López J, Blanco A. Possibilistic approach for biclustering microarray data. Comput Biol Med 2007; 37:1426-36. [PMID: 17346690 DOI: 10.1016/j.compbiomed.2007.01.005] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2006] [Revised: 12/20/2006] [Accepted: 01/03/2007] [Indexed: 11/18/2022]
Abstract
Biclustering has emerged as an important method for analyzing gene expression data from microarray technology. It allows to identify groups of genes which behave similarly under a subset of conditions. As a gene may play more than one biological role in conjunction with distinct groups of genes, non-exclusive biclustering algorithms are required. In this paper we propose a new method to obtain potentially-overlapping biclusters, the Possibilistic Spectral Biclustering algorithm (PSB), based on Fuzzy Technology and Spectral Clustering. We tested our method on S. cerevisiae cell cycle expression data and on a human cancer dataset, validating the obtained biclusters using known classifications of conditions and GO Term Finder for functional annotations of genes. Results are available at http://decsai.ugr.es/ approximately ccano/psb.
Collapse
Affiliation(s)
- C Cano
- Departamento de Ciencias de la Computación e I.A, E.T.S. Ingeniería Informática, University of Granada, Granada, Spain.
| | | | | | | |
Collapse
|
26
|
Abstract
With the advent of microarray technology it has been possible to measure thousands of expression values of genes in a single experiment. Biclustering or simultaneous clustering of both genes and conditions is challenging particularly for the analysis of high-dimensional gene expression data in information retrieval, knowledge discovery, and data mining. The objective here is to find sub-matrices, i.e., maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a range of conditions while maximizing the volume simultaneously. Since these two objectives are mutually conflicting, they become suitable candidates for multi-objective modeling. In this study, we will describe some recent literature on biclustering as well as a multi-objective evolutionary biclustering framework for gene expression data along with the experimental results.
Collapse
Affiliation(s)
- Haider Banka
- Center for Soft Computing Research: A National Facility, Indian Statistical Institute, Kolkata
| | - Sushmita Mitra
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata
| |
Collapse
|
27
|
Holt KE, Millar AH, Whelan J. ModuleFinder and CoReg: alternative tools for linking gene expression modules with promoter sequences motifs to uncover gene regulation mechanisms in plants. PLANT METHODS 2006; 2:8. [PMID: 16606469 PMCID: PMC1479336 DOI: 10.1186/1746-4811-2-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2005] [Accepted: 04/11/2006] [Indexed: 05/08/2023]
Abstract
BACKGROUND Uncovering the key sequence elements in gene promoters that regulate the expression of plant genomes is a huge task that will require a series of complementary methods for prediction, substantial innovations in experimental validation and a much greater understanding of the role of combinatorial control in the regulation of plant gene expression. RESULTS To add to this larger process and to provide alternatives to existing prediction methods, we have developed several tools in the statistical package R. ModuleFinder identifies sets of genes and treatments that we have found to form valuable sets for analysis of the mechanisms underlying gene co-expression. CoReg then links the hierarchical clustering of these co-expressed sets with frequency tables of promoter elements. These promoter elements can be drawn from known elements or all possible combinations of nucleotides in an element of various lengths. These sets of promoter elements represent putative cis-acting regulatory elements common to sets of co-expressed genes and can be prioritised for experimental testing. We have used these new tools to analyze the response of transcripts for nuclear genes encoding mitochondrial proteins in Arabidopsis to a range of chemical stresses. ModuleFinder provided a subset of co-expressed gene modules that are more logically related to biological functions than did subsets derived from traditional hierarchical clustering techniques. Importantly ModuleFinder linked responses in transcripts for electron transport chain components, carbon metabolism enzymes and solute transporter proteins. CoReg identified several promoter motifs that helped to explain the patterns of expression observed. CONCLUSION ModuleFinder identifies sets of genes and treatments that form useful sets for analysis of the mechanisms behind co-expression. CoReg links the clustering tree of expression-based relationships in these sets with frequency tables of promoter elements. These sets of promoter elements represent putative cis-acting regulatory elements for sets of genes, and can then be tested experimentally. We consider these tools, both built on an open source software product to provide valuable, alternative tools for the prioritisation of promoter elements for experimental analysis.
Collapse
Affiliation(s)
- Kathryn E Holt
- ARC Centre of Excellence in Plant Energy Biology, CMS Building M310 University of Western Australia, 35 Stirling Highway, Crawley 6009, Western Australia, Australia
| | - A Harvey Millar
- ARC Centre of Excellence in Plant Energy Biology, CMS Building M310 University of Western Australia, 35 Stirling Highway, Crawley 6009, Western Australia, Australia
| | - James Whelan
- ARC Centre of Excellence in Plant Energy Biology, CMS Building M310 University of Western Australia, 35 Stirling Highway, Crawley 6009, Western Australia, Australia
| |
Collapse
|