1
|
Gallopin M, Celeux G, Jaffrézic F, Rau A. A model selection criterion for model-based clustering of annotated gene expression data. Stat Appl Genet Mol Biol 2015; 14:413-28. [PMID: 26461845 DOI: 10.1515/sagmb-2014-0095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data.
Collapse
|
2
|
Gene network coherence based on prior knowledge using direct and indirect relationships. Comput Biol Chem 2015; 56:142-51. [DOI: 10.1016/j.compbiolchem.2015.03.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2014] [Revised: 03/06/2015] [Accepted: 03/20/2015] [Indexed: 12/21/2022]
|
3
|
Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE). BMC Bioinformatics 2015; 16:70. [PMID: 25879888 PMCID: PMC4365810 DOI: 10.1186/s12859-015-0490-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 02/04/2015] [Indexed: 01/29/2023] Open
Abstract
Background Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. Results We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Conclusions Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0490-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- H Robert Frost
- Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, NH, 03756, USA. .,Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine, Lebanon, NH, 03756, USA. .,Department of Genetics, Dartmouth College, Hanover, NH, 03755, USA.
| | - Zhigang Li
- Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, NH, 03756, USA. .,Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine, Lebanon, NH, 03756, USA.
| | - Jason H Moore
- Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, NH, 03756, USA. .,Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine, Lebanon, NH, 03756, USA. .,Department of Genetics, Dartmouth College, Hanover, NH, 03755, USA.
| |
Collapse
|
4
|
Zeng J, Zhu S, Liew AWC, Yan H. Multiconstrained gene clustering based on generalized projections. BMC Bioinformatics 2010; 11:164. [PMID: 20356386 PMCID: PMC3098054 DOI: 10.1186/1471-2105-11-164] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2009] [Accepted: 03/31/2010] [Indexed: 11/10/2022] Open
Abstract
Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.
Collapse
Affiliation(s)
- Jia Zeng
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | | | | | | |
Collapse
|
5
|
Klie S, Nikoloski Z, Selbig J. Biological cluster evaluation for gene function prediction. J Comput Biol 2010; 21:428-45. [PMID: 20059365 DOI: 10.1089/cmb.2009.0129] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set.
Collapse
Affiliation(s)
- Sebastian Klie
- 1 Max-Planck Institute for Molecular Plant Physiology , Potsdam, Brandenburg, Germany
| | | | | |
Collapse
|
6
|
Mutwil M, Usadel B, Schütte M, Loraine A, Ebenhöh O, Persson S. Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm. PLANT PHYSIOLOGY 2010; 152:29-43. [PMID: 19889879 PMCID: PMC2799344 DOI: 10.1104/pp.109.145318] [Citation(s) in RCA: 122] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
A vital quest in biology is comprehensible visualization and interpretation of correlation relationships on a genome scale. Such relationships may be represented in the form of networks, which usually require disassembly into smaller manageable units, or clusters, to facilitate interpretation. Several graph-clustering algorithms that may be used to visualize biological networks are available. However, only some of these support weighted edges, and none provides good control of cluster sizes, which is crucial for comprehensible visualization of large networks. We constructed an interactive coexpression network for the Arabidopsis (Arabidopsis thaliana) genome using a novel Heuristic Cluster Chiseling Algorithm (HCCA) that supports weighted edges and that may control average cluster sizes. Comparative clustering analyses demonstrated that the HCCA performed as well as, or better than, the commonly used Markov, MCODE, and k-means clustering algorithms. We mapped MapMan ontology terms onto coexpressed node vicinities of the network, which revealed transcriptional organization of previously unrelated cellular processes. We further explored the predictive power of this network through mutant analyses and identified six new genes that are essential to plant growth. We show that the HCCA-partitioned network constitutes an ideal "cartographic" platform for visualization of correlation networks. This approach rapidly provides network partitions with relative uniform cluster sizes on a genome-scale level and may thus be used for correlation network layouts also for other species.
Collapse
|
7
|
Mentzen WI, Floris M, de la Fuente A. Dissecting the dynamics of dysregulation of cellular processes in mouse mammary gland tumor. BMC Genomics 2009; 10:601. [PMID: 20003387 PMCID: PMC2799442 DOI: 10.1186/1471-2164-10-601] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2009] [Accepted: 12/13/2009] [Indexed: 02/01/2023] Open
Abstract
Background Elucidating the sequence of molecular events underlying breast cancer formation is of enormous value for understanding this disease and for design of an effective treatment. Gene expression measurements have enabled the study of transcriptome-wide changes involved in tumorigenesis. This usually occurs through identification of differentially expressed genes or pathways. Results We propose a novel approach that is able to delineate new cancer-related cellular processes and the nature of their involvement in tumorigenesis. First, we define modules as densely interconnected and functionally enriched areas of a Protein Interaction Network. Second, 'differential expression' and 'differential co-expression' analyses are applied to the genes in these network modules, allowing for identification of processes that are up- or down-regulated, as well as processes disrupted (low co-expression) or invoked (high co-expression) in different tumor stages. Finally, we propose a strategy to identify regulatory miRNAs potentially responsible for the observed changes in module activities. We demonstrate the potential of this analysis on expression data from a mouse model of mammary gland tumor, monitored over three stages of tumorigenesis. Network modules enriched in adhesion and metabolic processes were found to be inactivated in tumor cells through the combination of dysregulation and down-regulation, whereas the activation of the integrin complex and immune system response modules is achieved through increased co-regulation and up-regulation. Additionally, we confirmed a known miRNA involved in mammary gland tumorigenesis, and present several new candidates for this function. Conclusions Understanding complex diseases requires studying them by integrative approaches that combine data sources and different analysis methods. The integration of methods and data sources proposed here yields a sensitive tool, able to pinpoint new processes with a role in cancer, dissect modulation of their activity and detect the varying assignments of genes to functional modules over the course of a disease.
Collapse
Affiliation(s)
- Wieslawa I Mentzen
- CRS4 Bioinformatica, Parco Scientifico e Technologico POLARIS, 09010 Pula (CA), Italy
| | | | | |
Collapse
|
8
|
Mentzen WI, Wurtele ES. Regulon organization of Arabidopsis. BMC PLANT BIOLOGY 2008; 8:99. [PMID: 18826618 PMCID: PMC2567982 DOI: 10.1186/1471-2229-8-99] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/06/2008] [Accepted: 09/30/2008] [Indexed: 05/20/2023]
Abstract
BACKGROUND Despite the mounting research on Arabidopsis transcriptome and the powerful tools to explore biology of this model plant, the organization of expression of Arabidopsis genome is only partially understood. Here, we create a coexpression network from a 22,746 Affymetrix probes dataset derived from 963 microarray chips that query the transcriptome in response to a wide variety of environmentally, genetically, and developmentally induced perturbations. RESULTS Markov chain graph clustering of the coexpression network delineates 998 regulons ranging from one to 1623 genes in size. To assess the significance of the clustering results, the statistical over-representation of GO terms is averaged over this set of regulons and compared to the analogous values for 100 randomly-generated sets of clusters. The set of regulons derived from the experimental data scores significantly better than any of the randomly-generated sets. Most regulons correspond to identifiable biological processes and include a combination of genes encoding related developmental, metabolic pathway, and regulatory functions. In addition, nearly 3000 genes of unknown molecular function or process are assigned to a regulon. Only five regulons contain plastomic genes; four of these are exclusively plastomic. In contrast, expression of the mitochondrial genome is highly integrated with that of nuclear genes; each of the seven regulons containing mitochondrial genes also incorporates nuclear genes. The network of regulons reveals a higher-level organization, with dense local neighborhoods articulated for photosynthetic function, genetic information processing, and stress response. CONCLUSION This analysis creates a framework for generation of experimentally testable hypotheses, gives insight into the concerted functions of Arabidopsis at the transcript level, and provides a test bed for comparative systems analysis.
Collapse
Affiliation(s)
- Wieslawa I Mentzen
- CRS4 Bioinformatics Laboratory, Parco Scientifico e Technologico POLARIS, 09010 Pula (CA), Italy
| | - Eve Syrkin Wurtele
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
9
|
Redestig H, Weicht D, Selbig J, Hannah MA. Transcription factor target prediction using multiple short expression time series from Arabidopsis thaliana. BMC Bioinformatics 2007; 8:454. [PMID: 18021423 PMCID: PMC2198923 DOI: 10.1186/1471-2105-8-454] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2007] [Accepted: 11/18/2007] [Indexed: 11/12/2022] Open
Abstract
Background The central role of transcription factors (TFs) in higher eukaryotes has led to much interest in deciphering transcriptional regulatory interactions. Even in the best case, experimental identification of TF target genes is error prone, and has been shown to be improved by considering additional forms of evidence such as expression data. Previous expression based methods have not explicitly tried to associate TFs with their targets and therefore largely ignored the treatment specific and time dependent nature of transcription regulation. Results In this study we introduce CERMT, Covariance based Extraction of Regulatory targets using Multiple Time series. Using simulated and real data we show that using multiple expression time series, selecting treatments in which the TF responds, allowing time shifts between TFs and their targets and using covariance to identify highly responding genes appear to be a good strategy. We applied our method to published TF – target gene relationships determined using expression profiling on TF mutants and show that in most cases we obtain significant target gene enrichment and in half of the cases this is sufficient to deliver a usable list of high-confidence target genes. Conclusion CERMT could be immediately useful in refining possible target genes of candidate TFs using publicly available data, particularly for organisms lacking comprehensive TF binding data. In the future, we believe its incorporation with other forms of evidence may improve integrative genome-wide predictions of transcriptional networks.
Collapse
Affiliation(s)
- Henning Redestig
- Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, D-14476 Potsdam-Golm, Germany.
| | | | | | | |
Collapse
|
10
|
Chen JL, Liu Y, Sam LT, Li J, Lussier YA. Evaluation of high-throughput functional categorization of human disease genes. BMC Bioinformatics 2007; 8 Suppl 3:S7. [PMID: 17493290 PMCID: PMC1892104 DOI: 10.1186/1471-2105-8-s3-s7] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. Conclusion Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.
Collapse
Affiliation(s)
- James L Chen
- Department of Medicine, Georgetown University Hospital, Washington, DC, USA
| | - Yang Liu
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA
| | - Lee T Sam
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA
| | - Jianrong Li
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA
| | - Yves A Lussier
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|