301
|
Boscolo R, Liao JC, Roychowdhury VP. An information theoretic exploratory method for learning patterns of conditional gene coexpression from microarray data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:15-24. [PMID: 18245872 DOI: 10.1109/tcbb.2007.1056] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
In this article, we introduce an exploratory framework for learning patterns of conditional co-expression in gene expression data. The main idea behind the proposed approach consists of estimating how the information content shared by a set of M nodes in a network (where each node is associated to an expression profile) varies upon conditioning on a set of L conditioning variables (in the simplest case represented by a separate set of expression profiles). The method is non-parametric and it is based on the concept of statistical co-information, which, unlike conventional correlation based techniques, is not restricted in scope to linear conditional dependency patterns. Moreover, such conditional co-expression relationships can potentially indicate regulatory interactions that do not manifest themselves when only pair-wise relationships are considered. A moment based approximation of the co-information measure is derived that efficiently gets around the problem of estimating high-dimensional multi-variate probability density functions from the data, a task usually not viable due to the intrinsic sample size limitations that characterize expression level measurements. By applying the proposed exploratory method, we analyzed a whole genome microarray assay of the eukaryote Saccharomices cerevisiae and were able to learn statistically significant patterns of conditional co-expression. A selection of such interactions that carry a meaningful biological interpretation are discussed.
Collapse
Affiliation(s)
- Riccardo Boscolo
- Department of Electrical Engineering, University of California, Los Angeles 90095, USA.
| | | | | |
Collapse
|
302
|
An overview of statistical decomposition techniques applied to complex systems. Comput Stat Data Anal 2008; 52:2292-2310. [PMID: 19724659 DOI: 10.1016/j.csda.2007.09.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The current state of the art in applied decomposition techniques is summarized within a comparative uniform framework. These techniques are classified by the parametric or information theoretic approaches they adopt. An underlying structural model common to all parametric approaches is outlined. The nature and premises of a typical information theoretic approach are stressed. Some possible application patterns for an information theoretic approach are illustrated. Composition is distinguished from decomposition by pointing out that the former is not a simple reversal of the latter. From the standpoint of application to complex systems, a general evaluation is provided.
Collapse
|
303
|
Jarboe LR, Hyduke DR, Tran LM, Chou KJY, Liao JC. Determination of the Escherichia coli S-nitrosoglutathione response network using integrated biochemical and systems analysis. J Biol Chem 2007; 283:5148-57. [PMID: 18070885 DOI: 10.1074/jbc.m706018200] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
During infection or denitrification, bacteria encounter reactive nitrogen species. Although the molecular targets of and defensive response against nitric oxide (NO) in Escherichia coli are well studied, the response elements specific to S-nitrosothiols are less clear. Previously, we employed an integrated systems biology approach to unravel the E. coli NO-response network. Here we use a similar approach to confirm that S-nitrosoglutathione (GSNO) primarily impacts the metabolic and regulatory programs of E. coli in minimal medium by reaction with homocysteine and cysteine and subsequent disruption of the methionine biosynthesis pathway. Targeting of homocysteine and cysteine results in altered regulatory activity of MetJ, MetR, and CysB, activation of the stringent response and growth inhibition. Deletion of metJ or supplementation with methionine strongly attenuated the effect of GSNO on growth and gene expression. Furthermore, GSNO inhibited the ArcAB two-component system. Consistent with the underlying nitrosative and thiol-oxidative chemistry, growth inhibition and the majority of the regulatory perturbations were dependent upon GSNO internalization by the Dpp dipeptide transporter. Contrastingly, perturbation of NsrR appeared to be a result of the submicromolar levels of NO released from GSNO and did not require GSNO internalization.
Collapse
Affiliation(s)
- Laura R Jarboe
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California 90095, USA
| | | | | | | | | |
Collapse
|
304
|
Jensen ST, Chen G, Stoeckert, Jr. CJ. Bayesian variable selection and data integration for biological regulatory networks. Ann Appl Stat 2007. [DOI: 10.1214/07-aoas130] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
305
|
Cheng C, Yan X, Sun F, Li LM. Inferring activity changes of transcription factors by binding association with sorted expression profiles. BMC Bioinformatics 2007; 8:452. [PMID: 18021409 PMCID: PMC2194743 DOI: 10.1186/1471-2105-8-452] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2007] [Accepted: 11/16/2007] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND The identification of transcription factors (TFs) associated with a biological process is fundamental to understanding its regulatory mechanisms. From microarray data, however, the activity changes of TFs often cannot be directly observed due to their relatively low expression levels, post-transcriptional modifications, and other complications. Several approaches have been proposed to infer TF activity changes from microarray data. In some models, a linear relationship between gene expression and TF-gene binding strength is assumed. In some other models, the target genes of a TF are first determined by a significance cutoff to binding affinity scores, and then expression differentiation is checked between the target and other genes. RESULTS We propose a novel method, referred to as BASE (binding association with sorted expression), to infer TF activity changes from microarray expression profiles with the help of binding affinity data. It searches the maximum association between bind affinity profile of a TF and expression change profile along the direction of sorted differentiation. The method does not make hard target gene selection, rather, the significances of TF activity changes are evaluated by permutation tests of binding association at the end. To show the effectiveness of this method, we apply it to three typical examples using different kinds of binding affinity data, namely, ChIP-chip data, motif discovery data, and positional weighted matrix scanning data, respectively. The implications obtained from all three examples are consistent with established biological results. Moreover, the inferences suggest new and biological meaningful hypotheses for further investigation. CONCLUSION The proposed method makes transcription inference from profiles of expression and binding affinity. The same machinery can be used to deal with various kinds of binding affinity data. The method does not require a linear assumption, and has the desirable property of scale-invariance with respect to TF-specific binding affinity. This method is easy to implement and can be routinely applied for transcriptional inferences in microarray studies.
Collapse
Affiliation(s)
- Chao Cheng
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-2910, USA.
| | | | | | | |
Collapse
|
306
|
Kluger Y, Kluger H, Tuck D. Association between pathways in regulatory networks. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2006:2036-40. [PMID: 17946929 DOI: 10.1109/iembs.2006.260730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
During cell progression from one state to another, such as transformation from benign to malignant conditions, cells undergo changes in gene regulation. To reveal state-dependent circuitries in human regulatory networks, we employed drafts of normal and malignant cell networks. Using these condition specific networks, gene profiles and annotated pathways we studied: a) the capacity to separate samples or cell states based on the collective expression of all the genes in each pathway rather than individual genes, b) the degree of regulatory network connectivity within and between pathways. Distinct cell types reveal notable differences in transcriptional activity in numerous pathways. On the other hand, in datasets from breast cancer patients with variable outcome the capacity of single pathway expression signatures to predict disease outcome is very limited, though this can be somewhat improved by combining multiple pathways. Remarkable connectivity between pathways on the transcriptional regulatory level revealed a non-modular network structure. Overall, network blueprints enable us to quantify the degree of interaction between condition specific co-regulated pathways. This can contribute to understanding deregulated processes associated with cancer.
Collapse
Affiliation(s)
- Yuval Kluger
- Dept. of Cell Biol., New York Univ. Sch. of Medicine, NY, NY 10016, USA.
| | | | | |
Collapse
|
307
|
Androulakis IP, Yang E, Almon RR. Analysis of time-series gene expression data: methods, challenges, and opportunities. Annu Rev Biomed Eng 2007; 9:205-28. [PMID: 17341157 PMCID: PMC4181347 DOI: 10.1146/annurev.bioeng.9.060906.151904] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Monitoring the change in expression patterns over time provides the distinct possibility of unraveling the mechanistic drivers characterizing cellular responses. Gene arrays measuring the level of mRNA expression of thousands of genes simultaneously provide a method of high-throughput data collection necessary for obtaining the scope of data required for understanding the complexities of living organisms. Unraveling the coherent complex structures of transcriptional dynamics is the goal of a large family of computational methods aiming at upgrading the information content of time-course gene expression data. In this review, we summarize the qualitative characteristics of these approaches, discuss the main challenges that this type of complex data present, and, finally, explore the opportunities in the context of developing mechanistic models of cellular response.
Collapse
Affiliation(s)
- I P Androulakis
- Biomedical Engineering Department, Rutgers University, Piscataway, New Jersey 08854, USA.
| | | | | |
Collapse
|
308
|
Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res 2007; 36:D866-70. [PMID: 17932051 PMCID: PMC2238822 DOI: 10.1093/nar/gkm815] [Citation(s) in RCA: 197] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Many Microbe Microarrays Database (M3D) is designed to facilitate the analysis and visualization of expression data in compendia compiled from multiple laboratories. M3D contains over a thousand Affymetrix microarrays for Escherichia coli, Saccharomyces cerevisiae and Shewanella oneidensis. The expression data is uniformly normalized to make the data generated by different laboratories and researchers more comparable. To facilitate computational analyses, M3D provides raw data (CEL file) and normalized data downloads of each compendium. In addition, web-based construction, visualization and download of custom datasets are provided to facilitate efficient interrogation of the compendium for more focused analyses. The experimental condition metadata in M3D is human curated with each chemical and growth attribute stored as a structured and computable set of experimental features with consistent naming conventions and units. All versions of the normalized compendia constructed for each species are maintained and accessible in perpetuity to facilitate the future interpretation and comparison of results published on M3D data. M3D is accessible at http://m3d.bu.edu/.
Collapse
Affiliation(s)
- Jeremiah J Faith
- Program in Bioinformatics, Boston University, 24 Cummington St. and Department of Biomedical Engineering, Boston University, 44 Cummington St., Boston, Massachusetts, 02215, USA
| | | | | | | | | | | | | | | |
Collapse
|
309
|
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 2007; 5:e8. [PMID: 17214507 PMCID: PMC1764438 DOI: 10.1371/journal.pbio.0050008] [Citation(s) in RCA: 980] [Impact Index Per Article: 57.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2006] [Accepted: 11/07/2006] [Indexed: 11/19/2022] Open
Abstract
Machine learning approaches offer the potential to systematically identify transcriptional regulatory interactions from a compendium of microarray expression profiles. However, experimental validation of the performance of these methods at the genome scale has remained elusive. Here we assess the global performance of four existing classes of inference algorithms using 445 Escherichia coli Affymetrix arrays and 3,216 known E. coli regulatory interactions from RegulonDB. We also developed and applied the context likelihood of relatedness (CLR) algorithm, a novel extension of the relevance networks class of algorithms. CLR demonstrates an average precision gain of 36% relative to the next-best performing algorithm. At a 60% true positive rate, CLR identifies 1,079 regulatory interactions, of which 338 were in the previously known network and 741 were novel predictions. We tested the predicted interactions for three transcription factors with chromatin immunoprecipitation, confirming 21 novel interactions and verifying our RegulonDB-based performance estimates. CLR also identified a regulatory link providing central metabolic control of iron transport, which we confirmed with real-time quantitative PCR. The compendium of expression data compiled in this study, coupled with RegulonDB, provides a valuable model system for further improvement of network inference algorithms using experimental data.
Collapse
Affiliation(s)
- Jeremiah J Faith
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Boris Hayete
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Joshua T Thaden
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Boston University School of Medicine, Boston, Massachusetts, United States of America
| | - Ilaria Mogno
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Department of Computer and Systems Science A. Ruberti, University of Rome, La Sapienza, Rome, Italy
| | - Jamey Wierzbowski
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Cellicon Biotechnologies, Boston, Massachusetts, United States of America
| | - Guillaume Cottarel
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Cellicon Biotechnologies, Boston, Massachusetts, United States of America
| | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - James J Collins
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Timothy S Gardner
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
310
|
Bussemaker HJ, Ward LD, Boorsma A. Dissecting complex transcriptional responses using pathway-level scores based on prior information. BMC Bioinformatics 2007; 8 Suppl 6:S6. [PMID: 17903287 PMCID: PMC1995543 DOI: 10.1186/1471-2105-8-s6-s6] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background The genomewide pattern of changes in mRNA expression measured using DNA
microarrays is typically a complex superposition of the response of multiple
regulatory pathways to changes in the environment of the cells. The use of prior
information, either about the function of the protein encoded by each gene, or
about the physical interactions between regulatory factors and the sequences
controlling its expression, has emerged as a powerful approach for dissecting
complex transcriptional responses. Results We review two different approaches for combining the noisy expression levels of
multiple individual genes into robust pathway-level differential expression
scores. The first is based on a comparison between the distribution of expression
levels of genes within a predefined gene set and those of all other genes in the
genome. The second starts from an estimate of the strength of genomewide
regulatory network connectivities based on sequence information or direct
measurements of protein-DNA interactions, and uses regression analysis to estimate
the activity of gene regulatory pathways. The statistical methods used are
explained in detail. Conclusion By avoiding the thresholding of individual genes, pathway-level analysis of
differential expression based on prior information can be considerably more
sensitive to subtle changes in gene expression than gene-level analysis. The
methods are technically straightforward and yield results that are easily
interpretable, both biologically and statistically.
Collapse
Affiliation(s)
- Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, 1212 Amsterdam Avenue, MC
2441, New York, NY 10027, USA
- Center for Computational Biology and Bioinformatics, Columbia University, New
York, NY, USA
| | - Lucas D Ward
- Department of Biological Sciences, Columbia University, 1212 Amsterdam Avenue, MC
2441, New York, NY 10027, USA
| | - Andre Boorsma
- Swammerdam Institute for Life Sciences, University of Amsterdam, BioCentrum
Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands
| |
Collapse
|
311
|
Wang RS, Wang Y, Zhang XS, Chen L. Inferring transcriptional regulatory networks from high-throughput data. ACTA ACUST UNITED AC 2007; 23:3056-64. [PMID: 17890736 DOI: 10.1093/bioinformatics/btm465] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Inferring the relationships between transcription factors (TFs) and their targets has utmost importance for understanding the complex regulatory mechanisms in cellular systems. However, the transcription factor activities (TFAs) cannot be measured directly by standard microarray experiment owing to various post-translational modifications. In particular, cooperative mechanism and combinatorial control are common in gene regulation, e.g. TFs usually recruit other proteins cooperatively to facilitate transcriptional reaction processes. RESULTS In this article, we propose a novel method for inferring transcriptional regulatory networks (TRN) from gene expression data based on protein transcription complexes and mass action law. With gene expression data and TFAs estimated from transcription complex information, the inference of TRN is formulated as a linear programming (LP) problem which has a globally optimal solution in terms of L(1) norm error. The proposed method not only can easily incorporate ChIP-Chip data as prior knowledge, but also can integrate multiple gene expression datasets from different experiments simultaneously. A unique feature of our method is to take into account protein cooperation in transcription process. We tested our method by using both synthetic data and several experimental datasets in yeast. The extensive results illustrate the effectiveness of the proposed method for predicting transcription regulatory relationships between TFs with co-regulators and target genes.
Collapse
Affiliation(s)
- Rui-Sheng Wang
- School of Information, Renmin University of China, Beijing 100872, China
| | | | | | | |
Collapse
|
312
|
Yuan S, Li KC. Context-dependent clustering for dynamic cellular state modeling of microarray gene expression. ACTA ACUST UNITED AC 2007; 23:3039-47. [PMID: 17846037 DOI: 10.1093/bioinformatics/btm457] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION High-throughput expression profiling allows researchers to study gene activities globally. Genes with similar expression profiles are likely to encode proteins that may participate in a common structural complex, metabolic pathway or biological process. Many clustering, classification and dimension reduction approaches, powerful in elucidating the expression data, are based on this rationale. However, the converse of this common perception can be misleading. In fact, many biologically related genes turn out uncorrelated in expression. RESULTS In this article, we present a novel method for investigating gene co-expression patterns. We assume the correlation between functionally related genes can be strengthened or weakened according to changes in some relevant, yet unknown, cellular states. We develop a context-dependent clustering (CDC) method to model the cellular state variable. We apply it to the transcription regulatory study for Saccharomyces cerevisiae, using the Stanford cell-cycle gene expression data. We investigate the co-expression patterns between transcription factors (TFs) and their target genes (TGs) predicted by the genome-wide location analysis of Harbison et al. Since TF regulates the expression of its TGs, correlation between TFs and TGs expression profiles can be expected. But as many authors have observed, the expression of transcription factors do not correlate well with the expression of their target genes. Instead of attributing the main reason to the lack of correlation between the transcript abundance and TF activity, we search for cellular conditions that would facilitate the TF-TG correlation. The results for sulfur amino acid pathway regulation by MET4, respiratory genes regulation by HAP4, and mitotic cell cycle regulation by ACE2/SWI5 are discussed in detail. Our method suggests a new way to understand the complex biological system from microarray data.
Collapse
Affiliation(s)
- Shinsheng Yuan
- Institute of Statistical Science, Acadmia Sinica, 128, Section 2, Academia Road, Nankang, Taipei 115, Taiwan, ROC
| | | |
Collapse
|
313
|
Larsen P, Almasri E, Chen G, Dai Y. A statistical method to incorporate biological knowledge for generating testable novel gene regulatory interactions from microarray experiments. BMC Bioinformatics 2007; 8:317. [PMID: 17727721 PMCID: PMC2082045 DOI: 10.1186/1471-2105-8-317] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 08/29/2007] [Indexed: 11/16/2022] Open
Abstract
Background The incorporation of prior biological knowledge in the analysis of microarray data has become important in the reconstruction of transcription regulatory networks in a cell. Most of the current research has been focused on the integration of multiple sets of microarray data as well as curated databases for a genome scale reconstruction. However, individual researchers are more interested in the extraction of most useful information from the data of their hypothesis-driven microarray experiments. How to compile the prior biological knowledge from literature to facilitate new hypothesis generation from a microarray experiment is the focus of this work. We propose a novel method based on the statistical analysis of reported gene interactions in PubMed literature. Results Using Gene Ontology (GO) Molecular Function annotation for reported gene regulatory interactions in PubMed literature, a statistical analysis method was proposed for the derivation of a likelihood of interaction (LOI) score for a pair of genes. The LOI-score and the Pearson correlation coefficient of gene profiles were utilized to check if a pair of query genes would be in the above specified interaction. The method was validated in the analysis of two gene sets formed from the yeast Saccharomyces cerevisiae cell cycle microarray data. It was found that high percentage of identified interactions shares GO Biological Process annotations (39.5% for a 102 interaction enriched gene set and 23.0% for a larger 999 cyclically expressed gene set). Conclusion This method can uncover novel biologically relevant gene interactions. With stringent confidence levels, small interaction networks can be identified for further establishment of a hypothesis testable by biological experiment. This procedure is computationally inexpensive and can be used as a preprocessing procedure for screening potential biologically relevant gene pairs subject to the analysis with sophisticated statistical methods.
Collapse
Affiliation(s)
- Peter Larsen
- Core Genomics Laboratory at University of Illinois at Chicago, 845 West Taylor Street Chicago, IL 60607, USA
| | - Eyad Almasri
- Department of Bioengineering (MC063), University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA
| | - Guanrao Chen
- Department of Computer Science, University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA
| | - Yang Dai
- Department of Bioengineering (MC063), University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA
| |
Collapse
|
314
|
Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J. Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinformatics 2007; 8:299. [PMID: 17697349 PMCID: PMC2212665 DOI: 10.1186/1471-2105-8-299] [Citation(s) in RCA: 160] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2006] [Accepted: 08/14/2007] [Indexed: 11/16/2022] Open
Abstract
Background Large-scale sequencing of entire genomes has ushered in a new age in biology. One of the next grand challenges is to dissect the cellular networks consisting of many individual functional modules. Defining co-expression networks without ambiguity based on genome-wide microarray data is difficult and current methods are not robust and consistent with different data sets. This is particularly problematic for little understood organisms since not much existing biological knowledge can be exploited for determining the threshold to differentiate true correlation from random noise. Random matrix theory (RMT), which has been widely and successfully used in physics, is a powerful approach to distinguish system-specific, non-random properties embedded in complex systems from random noise. Here, we have hypothesized that the universal predictions of RMT are also applicable to biological systems and the correlation threshold can be determined by characterizing the correlation matrix of microarray profiles using random matrix theory. Results Application of random matrix theory to microarray data of S. oneidensis, E. coli, yeast, A. thaliana, Drosophila, mouse and human indicates that there is a sharp transition of nearest neighbour spacing distribution (NNSD) of correlation matrix after gradually removing certain elements insider the matrix. Testing on an in silico modular model has demonstrated that this transition can be used to determine the correlation threshold for revealing modular co-expression networks. The co-expression network derived from yeast cell cycling microarray data is supported by gene annotation. The topological properties of the resulting co-expression network agree well with the general properties of biological networks. Computational evaluations have showed that RMT approach is sensitive and robust. Furthermore, evaluation on sampled expression data of an in silico modular gene system has showed that under-sampled expressions do not affect the recovery of gene co-expression network. Moreover, the cellular roles of 215 functionally unknown genes from yeast, E. coli and S. oneidensis are predicted by the gene co-expression networks using guilt-by-association principle, many of which are supported by existing information or our experimental verification, further demonstrating the reliability of this approach for gene function prediction. Conclusion Our rigorous analysis of gene expression microarray profiles using RMT has showed that the transition of NNSD of correlation matrix of microarray profile provides a profound theoretical criterion to determine the correlation threshold for identifying gene co-expression networks.
Collapse
Affiliation(s)
- Feng Luo
- Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
- School of Computing, Clemson University, Clemson, SC, 29634, USA
| | - Yunfeng Yang
- Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
| | - Jianxin Zhong
- Computer Science & Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
- Department of Physics, Xiangtan University, Hunan 411105, PR China
| | - Haichun Gao
- Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
- Insitute for Environmental Genomics, and Department of Botany and Microbiology, University of Oklahoma, Norman, OK, 73019, USA
| | - Latifur Khan
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA
| | - Dorothea K Thompson
- Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA
| | - Jizhong Zhou
- Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
- Insitute for Environmental Genomics, and Department of Botany and Microbiology, University of Oklahoma, Norman, OK, 73019, USA
| |
Collapse
|
315
|
Bussemaker HJ, Foat BC, Ward LD. Predictive modeling of genome-wide mRNA expression: from modules to molecules. ACTA ACUST UNITED AC 2007; 36:329-47. [PMID: 17311525 DOI: 10.1146/annurev.biophys.36.040306.132725] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Various algorithms are available for predicting mRNA expression and modeling gene regulatory processes. They differ in whether they rely on the existence of modules of coregulated genes or build a model that applies to all genes, whether they represent regulatory activities as hidden variables or as mRNA levels, and whether they implicitly or explicitly model the complex cis-regulatory logic of multiple interacting transcription factors binding the same DNA. The fact that functional genomics data of different types reflect the same molecular processes provides a natural strategy for integrative computational analysis. One promising avenue toward an accurate and comprehensive model of gene regulation combines biophysical modeling of the interactions among proteins, DNA, and RNA with the use of large-scale functional genomics data to estimate regulatory network connectivity and activity parameters. As the ability of these models to represent complex cis-regulatory logic increases, the need for approaches based on cross-species conservation may diminish.
Collapse
Affiliation(s)
- Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.
| | | | | |
Collapse
|
316
|
Teschendorff AE, Journée M, Absil PA, Sepulchre R, Caldas C. Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS Comput Biol 2007; 3:e161. [PMID: 17708679 PMCID: PMC1950343 DOI: 10.1371/journal.pcbi.0030161] [Citation(s) in RCA: 104] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2007] [Accepted: 06/28/2007] [Indexed: 12/29/2022] Open
Abstract
The quantity of mRNA transcripts in a cell is determined by a complex interplay of cooperative and counteracting biological processes. Independent Component Analysis (ICA) is one of a few number of unsupervised algorithms that have been applied to microarray gene expression data in an attempt to understand phenotype differences in terms of changes in the activation/inhibition patterns of biological pathways. While the ICA model has been shown to outperform other linear representations of the data such as Principal Components Analysis (PCA), a validation using explicit pathway and regulatory element information has not yet been performed. We apply a range of popular ICA algorithms to six of the largest microarray cancer datasets and use pathway-knowledge and regulatory-element databases for validation. We show that ICA outperforms PCA and clustering-based methods in that ICA components map closer to known cancer-related pathways, regulatory modules, and cancer phenotypes. Furthermore, we identify cancer signalling and oncogenic pathways and regulatory modules that play a prominent role in breast cancer and relate the differential activation patterns of these to breast cancer phenotypes. Importantly, we find novel associations linking immune response and epithelial–mesenchymal transition pathways with estrogen receptor status and histological grade, respectively. In addition, we find associations linking the activity levels of biological pathways and transcription factors (NF1 and NFAT) with clinical outcome in breast cancer. ICA provides a framework for a more biologically relevant interpretation of genomewide transcriptomic data. Adopting ICA as the analysis tool of choice will help understand the phenotype–pathway relationship and thus help elucidate the molecular taxonomy of heterogeneous cancers and of other complex genetic diseases. The amount of a given transcript or protein in a cell is determined by a balance of expression and repression in a complex network of biological processes. This delicate balance is compromised in complex genetic diseases such as cancer by alterations in the activation patterns of functionally important biological processes known as pathways. Over the last years, a large number of microarray experiments profiling the expression levels of more than 20,000 human genes in hundreds of tumor samples have shown that most cancer types are heterogeneous diseases, each characterized by many different expression subtypes. The biological and clinical goal is to explain the observed tumor and clinical heterogeneity in terms of specific patterns of altered pathways. The bioinformatic challenge is therefore to devise mathematical tools that explicitly attempt to infer these altered pathways. To this end, we applied a signal processing tool in a meta-analysis of breast cancer, encompassing more than 800 tumor specimens derived from four different patient cohorts, and showed that this algorithm significantly outperforms popular standard bioinformatics tools in identifying altered pathways underlying breast cancer. These results show that the same tool could be applied to other complex human genetic diseases to better elucidate the underlying altered pathways.
Collapse
Affiliation(s)
- Andrew E Teschendorff
- Breast Cancer Functional Genomics Laboratory, Cancer Research UK Cambridge Research Institute, Cambridge, United Kingdom.
| | | | | | | | | |
Collapse
|
317
|
Sun W, Yu T, Li KC. Detection of eQTL modules mediated by activity levels of transcription factors. ACTA ACUST UNITED AC 2007; 23:2290-7. [PMID: 17599927 DOI: 10.1093/bioinformatics/btm327] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Studies of gene expression quantitative trait loci (eQTL) in different organisms have shown the existence of eQTL hot spots: each being a small segment of DNA sequence that harbors the eQTL of a large number of genes. Two questions of great interest about eQTL hot spots arise: (1) which gene within the hot spot is responsible for the linkages, i.e. which gene is the quantitative trait gene (QTG)? (2) How does a QTG affect the expression levels of many genes linked to it? Answers to the first question can be offered by available biological evidence or by statistical methods. The second question is harder to address. One simple situation is that the QTG encodes a transcription factor (TF), which regulates the expression of genes linked to it. However, previous results have shown that TFs are not overrepresented in the eQTL hot spots. In this article, we consider the scenario that the propagation of genetic perturbation from a QTG to other linked genes is mediated by the TF activity. We develop a procedure to detect the eQTL modules (eQTL hot spots together with linked genes) that are compatible with this scenario. RESULTS We first detect 27 eQTL modules from a yeast eQTL data, and estimate TF activity profiles using the method of Yu and Li (2005). Then likelihood ratio tests (LRTs) are conducted to find 760 relationships supporting the scenario of TF activity mediation: (DNA polymorphism --> cis-linked gene --> TF activity --> downstream linked gene). They are organized into 4 eQTL modules: an amino acid synthesis module featuring a cis-linked gene LEU2 and the mediating TF Leu3; a pheromone response module featuring a cis-linked gene GPA1 and the mediating TF Ste12; an energy-source control module featuring two cis-linked genes, GSY2 and HAP1, and the mediating TF Hap1; a mitotic exit module featuring four cis-linked genes, AMN1, CSH1, DEM1 and TOS1, and the mediating TF complex Ace2/Swi5. Gene Ontology is utilized to reveal interesting functional groups of the downstream genes in each module. AVAILABILITY Our methods are implemented in an R package: eqtl.TF, which includes source codes and relevant data. It can be freely downloaded at http://www.stat.ucla.edu/~sunwei/software.htm. SUPPLEMENTARY INFORMATION http://www.stat.ucla.edu/~sunwei/yeast_eQTL_TF/supplementary.pdf.
Collapse
Affiliation(s)
- Wei Sun
- Department of Statistics, University of California at Los Angeles, Los Angeles, California, USA
| | | | | |
Collapse
|
318
|
Wu WS, Li WH, Chen BS. Identifying regulatory targets of cell cycle transcription factors using gene expression and ChIP-chip data. BMC Bioinformatics 2007; 8:188. [PMID: 17559637 PMCID: PMC1906835 DOI: 10.1186/1471-2105-8-188] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2006] [Accepted: 06/08/2007] [Indexed: 11/27/2022] Open
Abstract
Background ChIP-chip data, which indicate binding of transcription factors (TFs) to DNA regions in vivo, are widely used to reconstruct transcriptional regulatory networks. However, the binding of a TF to a gene does not necessarily imply regulation. Thus, it is important to develop methods to identify regulatory targets of TFs from ChIP-chip data. Results We developed a method, called Temporal Relationship Identification Algorithm (TRIA), which uses gene expression data to identify a TF's regulatory targets among its binding targets inferred from ChIP-chip data. We applied TRIA to yeast cell cycle microarray data and identified many plausible regulatory targets of cell cycle TFs. We validated our predictions by checking the enrichments for functional annotation and known cell cycle genes. Moreover, we showed that TRIA performs better than two published methods (MA-Network and MFA). It is known that co-regulated genes may not be co-expressed. TRIA has the ability to identify subsets of highly co-expressed genes among the regulatory targets of a TF. Different functional roles are found for different subsets, indicating the diverse functions a TF could have. Finally, for a control, we showed that TRIA also performs well for cell-cycle irrelevant TFs. Conclusion Finding the regulatory targets of TFs is important for understanding how cells change their transcription program to adapt to environmental stimuli. Our algorithm TRIA is helpful for achieving this purpose.
Collapse
Affiliation(s)
- Wei-Sheng Wu
- Lab of Control and Systems Biology, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 300, Taiwan
| | - Wen-Hsiung Li
- Department of Evolution and Ecology, University of Chicago, 1101 East 57Street, Chicago, IL, 60637, USA
- Genomics Research Center, Academia Sinica, Taipei, Taiwan
| | - Bor-Sen Chen
- Lab of Control and Systems Biology, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 300, Taiwan
| |
Collapse
|
319
|
Rosenfeld S. Stochastic cooperativity in non-linear dynamics of genetic regulatory networks. Math Biosci 2007; 210:121-42. [PMID: 17617426 DOI: 10.1016/j.mbs.2007.05.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2006] [Revised: 04/28/2007] [Accepted: 05/09/2007] [Indexed: 11/17/2022]
Abstract
Two major approaches are known in the field of stochastic dynamics of genetic regulatory networks (GRN). The first one, referred here to as the Markov Process Paradigm (MPP), places the focus of attention on the fact that many biochemical constituents vitally important for the network functionality are present only in small quantities within the cell, and therefore the regulatory process is essentially discrete and prone to relatively big fluctuations. The Master Equation of Markov Processes is an appropriate tool for the description of this kind of stochasticity. The second approach, the Non-linear Dynamics Paradigm (NDP), treats the regulatory process as essentially continuous. A natural tool for the description of such processes are deterministic differential equations. According to NDP, stochasticity in such systems occurs due to possible bistability and oscillatory motion within the limit cycles. The goal of this paper is to outline a third scenario of stochasticity in the regulatory process. This scenario is only conceivable in high-dimensional, highly non-linear systems, and thus represents an adequate framework for conceptually modeling the GRN. We refer to this framework as the Stochastic Cooperativity Paradigm (SCP). In this approach, the focus of attention is placed on the fact that in systems with the size and link density of GRN ( approximately 25000 and approximately 100, respectively), the confluence of all the factors which are necessary for gene expression is a comparatively rare event, and only massive redundancy makes such events sufficiently frequent. An immediate consequence of this rareness is 'burstiness' in mRNA and protein concentrations, a well known effect in intracellular dynamics. We demonstrate that a high-dimensional non-linear system, despite the absence of explicit mechanisms for suppressing inherent instability, may nevertheless reside in a state of stationary pseudo-random fluctuations which for all practical purposes may be regarded as a stochastic process. This type of stochastic behavior is an inherent property of such systems and requires neither an external random force as in the Langevin approach, nor the discreteness of the process as in MPP, nor highly specialized conditions of bistability as in NDP, nor bifurcations with transition to chaos as in low-dimensional chaotic maps.
Collapse
Affiliation(s)
- Simon Rosenfeld
- National Cancer Institute, EPN 3108, 6130 Executive Blvd., Bethesda, MD 20892, USA.
| |
Collapse
|
320
|
Hyduke DR, Jarboe LR, Tran LM, Chou KJY, Liao JC. Integrated network analysis identifies nitric oxide response networks and dihydroxyacid dehydratase as a crucial target in Escherichia coli. Proc Natl Acad Sci U S A 2007; 104:8484-9. [PMID: 17494765 PMCID: PMC1895976 DOI: 10.1073/pnas.0610888104] [Citation(s) in RCA: 114] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2006] [Indexed: 12/25/2022] Open
Abstract
Nitric oxide (NO) is used by mammalian immune systems to counter microbial invasions and is produced by bacteria during denitrification. As a defense, microorganisms possess a complex network to cope with NO. Here we report a combined transcriptomic, chemical, and phenotypic approach to identify direct NO targets and construct the biochemical response network. In particular, network component analysis was used to identify transcription factors that are perturbed by NO. Such information was screened with potential NO reaction mechanisms and phenotypic data from genetic knockouts to identify active chemistry and direct NO targets in Escherichia coli. This approach identified the comprehensive E. coli NO response network and evinced that NO halts bacterial growth via inhibition of the branched-chain amino acid biosynthesis enzyme dihydroxyacid dehydratase. Because mammals do not synthesize branched-chain amino acids, inhibition of dihydroxyacid dehydratase may have served to foster the role of NO in the immune arsenal.
Collapse
Affiliation(s)
- Daniel R. Hyduke
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095
| | - Laura R. Jarboe
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095
| | - Linh M. Tran
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095
| | - Katherine J. Y. Chou
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095
| | - James C. Liao
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095
| |
Collapse
|
321
|
Brynildsen MP, Wu TY, Jang SS, Liao JC. Biological network mapping and source signal deduction. ACTA ACUST UNITED AC 2007; 23:1783-91. [PMID: 17495996 DOI: 10.1093/bioinformatics/btm246] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Many biological networks, including transcriptional regulation, metabolism, and the absorbance spectra of metabolite mixtures, can be represented in a bipartite fashion. Key to understanding these bipartite networks are the network architecture and governing source signals. Such information is often implicitly imbedded in the data. Here we develop a technique, network component mapping (NCM), to deduce bipartite network connectivity and regulatory signals from data without any need for prior information. RESULTS We demonstrate the utility of our approach by analyzing UV-vis spectra from mixtures of metabolites and gene expression data from Saccharomyces cerevisiae. From UV-vis spectra, hidden mixing networks and pure component spectra (sources) were deduced to a higher degree of resolution with our method than other current bipartite techniques. Analysis of S. cerevisiae gene expression from two separate environmental conditions (zinc and DTT treatment) yielded transcription networks consistent with ChIP-chip derived network connectivity. Due to the high degree of noise in gene expression data, the transcription network for many genes could not be inferred. However, with relatively clean expression data, our technique was able to deduce hidden transcription networks and instances of combinatorial regulation. These results suggest that NCM can deduce correct network connectivity from relatively accurate data. For noisy data, NCM yields the sparsest network capable of explaining the data. In addition, partial knowledge of the network topology can be incorporated into NCM as constraints. AVAILABILITY Algorithm available on request from the authors. Soon to be posted on the web, http://www.seas.ucla.edu/~liaoj/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mark P Brynildsen
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095, USA
| | | | | | | |
Collapse
|
322
|
Zhan M. Deciphering modular and dynamic behaviors of transcriptional networks. Genomic Med 2007; 1:19-28. [PMID: 18923925 DOI: 10.1007/s11568-007-9004-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2007] [Accepted: 04/13/2007] [Indexed: 12/11/2022] Open
Abstract
The coordinated and dynamic modulation or interaction of genes or proteins acts as an important mechanism used by a cell in functional regulation. Recent studies have shown that many transcriptional networks exhibit a scale-free topology and hierarchical modular architecture. It has also been shown that transcriptional networks or pathways are dynamic and behave only in certain ways and controlled manners in response to disease development, changing cellular conditions, and different environmental factors. Moreover, evolutionarily conserved and divergent transcriptional modules underline fundamental and species-specific molecular mechanisms controlling disease development or cellular phenotypes. Various computational algorithms have been developed to explore transcriptional networks and modules from gene expression data. In silico studies have also been made to mimic the dynamic behavior of regulatory networks, analyzing how disease or cellular phenotypes arise from the connectivity or networks of genes and their products. Here, we review the recent development in computational biology research on deciphering modular and dynamic behaviors of transcriptional networks, highlighting important findings. We also demonstrate how these computational algorithms can be applied in systems biology studies as on disease, stem cells, and drug discovery.
Collapse
Affiliation(s)
- Ming Zhan
- Bioinformatics Unit, Research Resources Branch, National Institute on Aging, NIH, 333 Cassell Drive, Baltimore, MD, 21224, USA,
| |
Collapse
|
323
|
Abstract
Background In many approaches to the inference and modeling of regulatory interactions using microarray data, the expression of the gene coding for the transcription factor is considered to be an accurate surrogate for the true activity of the protein it produces. There are many instances where this is inaccurate due to post-translational modifications of the transcription factor protein. Inference of the activity of the transcription factor from the expression of its targets has predominantly involved linear models that do not reflect the nonlinear nature of transcription. We extend a recent approach to inferring the transcription factor activity based on nonlinear Michaelis-Menten kinetics of transcription from maximum likelihood to fully Bayesian inference and give an example of how the model can be further developed. Results We present results on synthetic and real microarray data. Additionally, we illustrate how gene and replicate specific delays can be incorporated into the model. Conclusion We demonstrate that full Bayesian inference is appropriate in this application and has several benefits over the maximum likelihood approach, especially when the volume of data is limited. We also show the benefits of using a non-linear model over a linear model, particularly in the case of repression.
Collapse
Affiliation(s)
- Simon Rogers
- Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, Glasgow, UK
| | - Raya Khanin
- Department of Statistics, University of Glasgow, Glasgow, UK
| | - Mark Girolami
- Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, Glasgow, UK
| |
Collapse
|
324
|
Chen G, Jensen ST, Stoeckert CJ. Clustering of genes into regulons using integrated modeling-COGRIM. Genome Biol 2007; 8:R4. [PMID: 17204163 PMCID: PMC1839128 DOI: 10.1186/gb-2007-8-1-r4] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2006] [Revised: 11/14/2006] [Accepted: 01/04/2007] [Indexed: 11/12/2022] Open
Abstract
COGRIM, an implementation that integrates gene expression, ChIP binding and transcription factor motif data, is described and applied to both unicellular and mammalian organisms. We present a Bayesian hierarchical model and Gibbs Sampling implementation that integrates gene expression, ChIP binding, and transcription factor motif data in a principled and robust fashion. COGRIM was applied to both unicellular and mammalian organisms under different scenarios of available data. In these applications, we demonstrate the ability to predict gene-transcription factor interactions with reduced numbers of false-positive findings and to make predictions beyond what is obtained when single types of data are considered.
Collapse
Affiliation(s)
- Guang Chen
- Department of Bioengineering, University of Pennsylvania, 240 Skirkanich Hall, 3320 Smith Walk, Philadelphia, Pennsylvania 19104, USA
- Center for Bioinformatics, University of Pennsylvania,1420 Blockley Hall, 423 Guardian Drive, Philadelphia, Pennsylvania 19104, USA
| | - Shane T Jensen
- Department of Statistics, The Wharton School, University of Pennsylvania, 463 Jon M. Huntsman Hall, 3730 Walnut Street, Philadelphia, Pennsylvania 19104, USA
| | - Christian J Stoeckert
- Center for Bioinformatics, University of Pennsylvania,1420 Blockley Hall, 423 Guardian Drive, Philadelphia, Pennsylvania 19104, USA
- Department of Genetics, School of Medicine, University of Pennsylvania, 415 Curie Boulevard, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
325
|
Rahib L, MacLennan NK, Horvath S, Liao JC, Dipple KM. Glycerol kinase deficiency alters expression of genes involved in lipid metabolism, carbohydrate metabolism, and insulin signaling. Eur J Hum Genet 2007; 15:646-57. [PMID: 17406644 DOI: 10.1038/sj.ejhg.5201801] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Glycerol kinase (GK) is at the interface of fat and carbohydrate metabolism and has been implicated in insulin resistance and type 2 diabetes mellitus. To define GK's role in insulin resistance, we examined gene expression in brown adipose tissue in a glycerol kinase knockout (KO) mouse model using microarray analysis. Global gene expression profiles of KO mice were distinct from wild type with 668 differentially expressed genes. These include genes involved in lipid metabolism, carbohydrate metabolism, insulin signaling, and insulin resistance. Real-time polymerase chain reaction analysis confirmed the differential expression of selected genes involved in lipid and carbohydrate metabolism. PathwayAssist analysis confirmed direct and indirect connections between glycerol kinase and genes in lipid metabolism, carbohydrate metabolism, insulin signaling, and insulin resistance. Network component analysis (NCA) showed that the transcription factors (TFs) PPAR-gamma, SREBP-1, SREBP-2, STAT3, STAT5, SP1, CEBPalpha, CREB, GR and PPAR-alpha have altered activity in the KO mice. NCA also revealed the individual contribution of these TFs on the expression of genes altered in the microarray data. This study elucidates the complex network of glycerol kinase and further confirms a possible role for glycerol kinase deficiency, a simple Mendelian disorder, in insulin resistance, and type 2 diabetes mellitus, a common complex genetic disorder.
Collapse
Affiliation(s)
- Lola Rahib
- Biomedical Engineering, Interdepartmental Program, Henry Samueli School of Engineering and Applied Science at UCLA, Los Angeles, CA 90095, USA
| | | | | | | | | |
Collapse
|
326
|
Abstract
Classically, metabolism was investigated by studying molecular characteristics of enzymes and their regulators in isolation. This reductionistic approach successfully established mechanistic relationships with the immediate interacting neighbors and allowed reconstruction of network structures. Severely underdeveloped was the ability to make precise predictions about the integrated operation of pathways and networks that emerged from the typically nonlinear and complex interactions of proteins and metabolites. The burden of metabolic engineering is a consequence of this fact-one cannot yet predict with any certainty precisely what needs to be engineered to produce more complex phenotypes. What was and still is missing are concepts, methods, and algorithms to integrate data and information into a quantitatively coherent whole, as well as theoretical concepts to reliably predict the consequence of environmental stimuli or genetic interventions. This introduction and perspective to Domain 3, Metabolism and Metabolic Fluxes, starts with a brief overview of the panoply of global measurement technologies that herald the dawning of systems biology and whose impact on metabolic research is apparent throughout the Domain 3. In the middle section, applications to Escherichia coli are used to illustrate general concepts and successes of computational methods that approach metabolism as a network of interacting elements, and thus have potential to fill the gap in quantitative data and information integration. The final section highlights prospective focus areas for future metabolic research, including functional genomics, eludication of evolutionary principles, and the integration of metabolism with regulatory networks.
Collapse
|
327
|
Sun J, Tuncay K, Haidar AA, Ensman L, Stanley F, Trelinski M, Ortoleva P. Transcriptional regulatory network discovery via multiple method integration: application to e. coli K12. Algorithms Mol Biol 2007; 2:2. [PMID: 17397539 PMCID: PMC1852316 DOI: 10.1186/1748-7188-2-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2006] [Accepted: 03/30/2007] [Indexed: 11/17/2022] Open
Abstract
Transcriptional regulatory network (TRN) discovery from one method (e.g. microarray analysis, gene ontology, phylogenic similarity) does not seem feasible due to lack of sufficient information, resulting in the construction of spurious or incomplete TRNs. We develop a methodology, TRND, that integrates a preliminary TRN, microarray data, gene ontology and phylogenic similarity to accurately discover TRNs and apply the method to E. coli K12. The approach can easily be extended to include other methodologies. Although gene ontology and phylogenic similarity have been used in the context of gene-gene networks, we show that more information can be extracted when gene-gene scores are transformed to gene-transcription factor (TF) scores using a preliminary TRN. This seems to be preferable over the construction of gene-gene interaction networks in light of the observed fact that gene expression and activity of a TF made of a component encoded by that gene is often out of phase. TRND multi-method integration is found to be facilitated by the use of a Bayesian framework for each method derived from its individual scoring measure and a training set of gene/TF regulatory interactions. The TRNs we construct are in better agreement with microarray data. The number of gene/TF interactions we discover is actually double that of existing networks.
Collapse
Affiliation(s)
- Jingjun Sun
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| | - Kagan Tuncay
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| | - Alaa Abi Haidar
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| | - Lisa Ensman
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| | - Frank Stanley
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| | - Michael Trelinski
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| | - Peter Ortoleva
- Center for Cell and Virus Theory, Chemistry Building, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
328
|
Wang J. A new framework for identifying combinatorial regulation of transcription factors: a case study of the yeast cell cycle. J Biomed Inform 2007; 40:707-25. [PMID: 17418646 DOI: 10.1016/j.jbi.2007.02.003] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2006] [Revised: 12/23/2006] [Accepted: 02/27/2007] [Indexed: 01/24/2023]
Abstract
By integrating heterogeneous functional genomic datasets, we have developed a new framework for detecting combinatorial control of gene expression, which includes estimating transcription factor activities using a singular value decomposition method and reducing high-dimensional input gene space by considering genomic properties of gene clusters. The prediction of cooperative gene regulation is accomplished by either Gaussian Graphical Models or Pairwise Mixed Graphical Models. The proposed framework was tested on yeast cell cycle datasets: (1) 54 known yeast cell cycle genes with 9 cell cycle regulators and (2) 676 putative yeast cell cycle genes with 9 cell cycle regulators. The new framework gave promising results on inferring TF-TF and TF-gene interactions. It also revealed several interesting mechanisms such as negatively correlated protein-protein interactions and low affinity protein-DNA interactions that may be important during the yeast cell cycle. The new framework may easily be extended to study other higher eukaryotes.
Collapse
Affiliation(s)
- Junbai Wang
- Department of Biological Sciences, Columbia University, 1212, Amsterdam Avenue, MC 2442, New York, NY 10027, USA.
| |
Collapse
|
329
|
Pournara I, Wernisch L. Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinformatics 2007; 8:61. [PMID: 17319944 PMCID: PMC1821042 DOI: 10.1186/1471-2105-8-61] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2006] [Accepted: 02/23/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes. RESULTS In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result. CONCLUSION Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information.
Collapse
Affiliation(s)
- Iosifina Pournara
- School of Crystallography, Birkbeck College, University of London, London, UK
| | - Lorenz Wernisch
- School of Crystallography, Birkbeck College, University of London, London, UK
| |
Collapse
|
330
|
Yang YL, Liao JC. Network component analysis of Saccharamyces cerevisiae stress response. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2937-40. [PMID: 17270893 DOI: 10.1109/iembs.2004.1403834] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
A method, network component analysis, was developed for uncovering hidden regulatory signals from outputs of networked systems, when only partial knowledge of the underlying network topology is available. This method was successfully applied to microarray data of yeast Saccharamyces cerevisiae under various stress conditions. The activities of 96 transcription factors were determined, which differ significantly from their gene expression patterns.
Collapse
|
331
|
Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert JP. Classification of microarray data using gene networks. BMC Bioinformatics 2007; 8:35. [PMID: 17270037 PMCID: PMC1797191 DOI: 10.1186/1471-2105-8-35] [Citation(s) in RCA: 131] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2006] [Accepted: 02/01/2007] [Indexed: 11/18/2022] Open
Abstract
Background Microarrays have become extremely useful for analysing genetic phenomena, but establishing a relation between microarray analysis results (typically a list of genes) and their biological significance is often difficult. Currently, the standard approach is to map a posteriori the results onto gene networks in order to elucidate the functions perturbed at the level of pathways. However, integrating a priori knowledge of the gene networks could help in the statistical analysis of gene expression data and in their biological interpretation. Results We propose a method to integrate a priori the knowledge of a gene network in the analysis of gene expression data. The approach is based on the spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph, resulting in an attenuation of the high-frequency components of the expression profiles with respect to the topology of the graph. We show how to derive unsupervised and supervised classification algorithms of expression profiles, resulting in classifiers with biological relevance. We illustrate the method with the analysis of a set of expression profiles from irradiated and non-irradiated yeast strains. Conclusion Including a priori knowledge of a gene network for the analysis of gene expression data leads to good classification performance and improved interpretability of the results.
Collapse
Affiliation(s)
- Franck Rapaport
- lnstitut Curie, Service de Bioinformatique, 26 rue d'Ulm, F-75248 Paris Cedex 05, France
- Ecole des Mines de Paris, Centre for Computational Biology, 35 rue Saint-Honoré, 77300 Fontainebleau, France
| | - Andrei Zinovyev
- lnstitut Curie, Service de Bioinformatique, 26 rue d'Ulm, F-75248 Paris Cedex 05, France
| | - Marie Dutreix
- lnstitut Curie, CNRS-UMR 2027, Bâtiment 110, Centre Universitaire, F-91405 Orsay, France
| | - Emmanuel Barillot
- lnstitut Curie, Service de Bioinformatique, 26 rue d'Ulm, F-75248 Paris Cedex 05, France
| | - Jean-Philippe Vert
- Ecole des Mines de Paris, Centre for Computational Biology, 35 rue Saint-Honoré, 77300 Fontainebleau, France
| |
Collapse
|
332
|
Martin S, Zhang Z, Martino A, Faulon JL. Boolean dynamics of genetic regulatory networks inferred from microarray time series data. ACTA ACUST UNITED AC 2007; 23:866-74. [PMID: 17267426 DOI: 10.1093/bioinformatics/btm021] [Citation(s) in RCA: 124] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Methods available for the inference of genetic regulatory networks strive to produce a single network, usually by optimizing some quantity to fit the experimental observations. In this article we investigate the possibility that multiple networks can be inferred, all resulting in similar dynamics. This idea is motivated by theoretical work which suggests that biological networks are robust and adaptable to change, and that the overall behavior of a genetic regulatory network might be captured in terms of dynamical basins of attraction. RESULTS We have developed and implemented a method for inferring genetic regulatory networks for time series microarray data. Our method first clusters and discretizes the gene expression data using k-means and support vector regression. We then enumerate Boolean activation-inhibition networks to match the discretized data. Finally, the dynamics of the Boolean networks are examined. We have tested our method on two immunology microarray datasets: an IL-2-stimulated T cell response dataset and a LPS-stimulated macrophage response dataset. In both cases, we discovered that many networks matched the data, and that most of these networks had similar dynamics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shawn Martin
- Sandia National Laboratories, Computational Biology Department, PO Box 5800, Albuquerque, NM 87185-1316, USA
| | | | | | | |
Collapse
|
333
|
Transcriptional regulatory network refinement and quantification through kinetic modeling, gene expression microarray data and information theory. BMC Bioinformatics 2007; 8:20. [PMID: 17244365 PMCID: PMC1790715 DOI: 10.1186/1471-2105-8-20] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2006] [Accepted: 01/23/2007] [Indexed: 11/10/2022] Open
Abstract
Background Gene expression microarray and other multiplex data hold promise for addressing the challenges of cellular complexity, refined diagnoses and the discovery of well-targeted treatments. A new approach to the construction and quantification of transcriptional regulatory networks (TRNs) is presented that integrates gene expression microarray data and cell modeling through information theory. Given a partial TRN and time series data, a probability density is constructed that is a functional of the time course of transcription factor (TF) thermodynamic activities at the site of gene control, and is a function of mRNA degradation and transcription rate coefficients, and equilibrium constants for TF/gene binding. Results Our approach yields more physicochemical information that compliments the results of network structure delineation methods, and thereby can serve as an element of a comprehensive TRN discovery/quantification system. The most probable TF time courses and values of the aforementioned parameters are obtained by maximizing the probability obtained through entropy maximization. Observed time delays between mRNA expression and activity are accounted for implicitly since the time course of the activity of a TF is coupled by probability functional maximization, and is not assumed to be proportional to expression level of the mRNA type that translates into the TF. This allows one to investigate post-translational and TF activation mechanisms of gene regulation. Accuracy and robustness of the method are evaluated. A kinetic formulation is used to facilitate the analysis of phenomena with a strongly dynamical character while a physically-motivated regularization of the TF time course is found to overcome difficulties due to omnipresent noise and data sparsity that plague other methods of gene expression data analysis. An application to Escherichia coli is presented. Conclusion Multiplex time series data can be used for the construction of the network of cellular processes and the calibration of the associated physicochemical parameters. We have demonstrated these concepts in the context of gene regulation understood through the analysis of gene expression microarray time series data. Casting the approach in a probabilistic framework has allowed us to address the uncertainties in gene expression microarray data. Our approach was found to be robust to error in the gene expression microarray data and mistakes in a proposed TRN.
Collapse
|
334
|
Bioinformatics analysis of the early inflammatory response in a rat thermal injury model. BMC Bioinformatics 2007; 8:10. [PMID: 17214898 PMCID: PMC1797813 DOI: 10.1186/1471-2105-8-10] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2006] [Accepted: 01/10/2007] [Indexed: 12/25/2022] Open
Abstract
Background Thermal injury is among the most severe forms of trauma and its effects are both local and systemic. Response to thermal injury includes cellular protection mechanisms, inflammation, hypermetabolism, prolonged catabolism, organ dysfunction and immuno-suppression. It has been hypothesized that gene expression patterns in the liver will change with severe burns, thus reflecting the role the liver plays in the response to burn injury. Characterizing the molecular fingerprint (i.e., expression profile) of the inflammatory response resulting from burns may help elucidate the activated mechanisms and suggest new therapeutic intervention. In this paper we propose a novel integrated framework for analyzing time-series transcriptional data, with emphasis on the burn-induced response within the context of the rat animal model. Our analysis robustly identifies critical expression motifs, indicative of the dynamic evolution of the inflammatory response and we further propose a putative reconstruction of the associated transcription factor activities. Results Implementation of our algorithm on data obtained from an animal (rat) burn injury study identified 281 genes corresponding to 4 unique profiles. Enrichment evaluation upon both gene ontologies and transcription factors, verifies the inflammation-specific character of the selections and the rationalization of the burn-induced inflammatory response. Conducting the transcription network reconstruction and analysis, we have identified transcription factors, including AHR, Octamer Binding Proteins, Kruppel-like Factors, and cell cycle regulators as being highly important to an organism's response to burn response. These transcription factors are notable due to their roles in pathways that play a part in the gross physiological response to burn such as changes in the immune response and inflammation. Conclusion Our results indicate that our novel selection/classification algorithm has been successful in selecting out genes with play an important role in thermal injury. Additionally, we have demonstrated the value of an integrative approach in identifying possible points of intervention, namely the activation of certain transcription factors that govern the organism's response.
Collapse
|
335
|
Kapil A, Gudi RD, Noronha SB. Gene expression profile analysis using discrimination and fuzzy classification methods. ASIA-PAC J CHEM ENG 2007. [DOI: 10.1002/apj.12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
336
|
Li H, Sun Y, Zhan M. The discovery of transcriptional modules by a two-stage matrix decomposition approach. ACTA ACUST UNITED AC 2006; 23:473-9. [PMID: 17189296 DOI: 10.1093/bioinformatics/btl640] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION We address the problem of identifying gene transcriptional modules from gene expression data by proposing a new approach. Genes mostly interact with each other to form transcriptional modules for context-specific cellular activities or functions. Unraveling such transcriptional modules is important for understanding biological network, deciphering regulatory mechanisms and identifying biomarkers. METHOD The proposed algorithm is based on two-stage matrix decomposition. We first model microarray data as non-linear mixtures and adopt the non-linear independent component analysis to reduce the non-linear distortion and separate the data into independent latent components. We then apply the probabilistic sparse matrix decomposition approach to model the 'hidden' expression profiles of genes across the independent latent components as linear weighted combinations of a small number of transcriptional regulator profiles. Finally, we propose a general scheme for identifying gene modules from the outcomes of the matrix decomposition. RESULTS The proposed algorithm partitions genes into non-mutually exclusive transcriptional modules, independent from expression profile similarity measurement. The modules contain genes with not only similar but different expression patterns, and show the highest enrichment of biological functions in comparison with those by other methods. The usefulness of the algorithm was validated by a yeast microarray data analysis. AVAILABILITY The software is available upon request to the authors.
Collapse
Affiliation(s)
- Huai Li
- Bioinformatics Unit, Branch of Research Resources, National Institute on Aging, NIH, Baltimore, MD 21224, USA
| | | | | |
Collapse
|
337
|
De Keersmaecker SCJ, Thijs IMV, Vanderleyden J, Marchal K. Integration of omics data: how well does it work for bacteria? Mol Microbiol 2006; 62:1239-50. [PMID: 17040488 DOI: 10.1111/j.1365-2958.2006.05453.x] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
In the current omics era, innovative high-throughput technologies allow measuring temporal and conditional changes at various cellular levels. Although individual analysis of each of these omics data undoubtedly results into interesting findings, it is only by integrating them that gaining a global insight into cellular behaviour can be aimed at. A systems approach thus is predicated on data integration. However, because of the complexity of biological systems and the specificities of the data-generating technologies (noisiness, heterogeneity, etc.), integrating omics data in an attempt to reconstruct signalling networks is not trivial. Developing its methodologies constitutes a major research challenge. Besides for their intrinsic value towards health care, environment and industry, prokaryotes are ideal model systems to further develop these methods because of their lower regulatory complexity compared with eukaryotes, and the ease with which they can be manipulated. Several successful examples outlined in this review already show the potential of the systems approach for both fundamental and industrial applications, which would be time-consuming or impossible to develop solely through traditional reductionist approaches.
Collapse
Affiliation(s)
- Sigrid C J De Keersmaecker
- Centre of Microbial and Plant Genetics (CMPG) Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, Belgium
| | | | | | | |
Collapse
|
338
|
Abstract
MOTIVATION Global gene expression measurements as obtained, for example, in microarray experiments can provide important clues to the underlying transcriptional control mechanisms and network structure of a biological cell. In the absence of a detailed understanding of this gene regulation, current attempts at classification of expression data rely on clustering and pattern recognition techniques employing ad-hoc similarity criteria. To improve this situation, a better understanding of the expected relationships between expression profiles of genes associated by biological function is required. RESULTS It is shown that perturbation expansions familiar from biological systems theory make precise predictions for the types of relationships to be expected for expression profiles of biologically associated genes, even if the underlying biological factors responsible for this association are not known. Classification criteria are derived, most of which are not usually employed in clustering algorithms. The approach is illustrated by using the AtGenExpress Arabidopsis thaliana developmental expression map.
Collapse
Affiliation(s)
- Andreas W Schreiber
- Australian Centre for Plant Functional Genomics, Hartley Grove, PMB 1 Waite Campus, The University of Adelaide Glen Osmond 5064, Australia.
| | | |
Collapse
|
339
|
Raab RM. Incorporating genome-scale tools for studying energy homeostasis. Nutr Metab (Lond) 2006; 3:40. [PMID: 17081308 PMCID: PMC1636640 DOI: 10.1186/1743-7075-3-40] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 11/03/2006] [Indexed: 11/16/2022] Open
Abstract
Mammals have evolved complex regulatory systems that enable them to maintain energy homeostasis despite constant environmental challenges that limit the availability of energy inputs and their composition. Biological control relies upon intricate systems composed of multiple organs and specialized cell types that regulate energy up-take, storage, and expenditure. Because these systems simultaneously perform diverse functions and are highly integrated, they are extremely difficult to understand in terms of their individual component contributions to energy homeostasis. In order to provide improved treatments and clinical options, it is important to identify the principle genetic and molecular components, as well as the systemic features of regulation. To begin, many of these features can be discovered by integrating experimental technologies with advanced methods of analysis. This review focuses on the analysis of transcriptional data derived from microarrays and how it can complement other experimental techniques to study energy homeostasis.
Collapse
|
340
|
Abstract
New technologies are permitting large-scale quantitative studies of signal-transduction networks. Such data are hard to understand completely by inspection and intuition. 'Data-driven models' help users to analyse large data sets by simplifying the measurements themselves. Data-driven modelling approaches such as clustering, principal components analysis and partial least squares can derive biological insights from large-scale experiments. These models are emerging as standard tools for systems-level research in signalling networks.
Collapse
Affiliation(s)
- Kevin A Janes
- Cell Decision Processes Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | | |
Collapse
|
341
|
Brynildsen MP, Tran LM, Liao JC. A Gibbs sampler for the identification of gene expression and network connectivity consistency. Bioinformatics 2006; 22:3040-6. [PMID: 17060361 DOI: 10.1093/bioinformatics/btl541] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Data from DNA microarrays and ChIP-chip binding assays often form the basis of transcriptional regulatory analyses. However, experimental noise in both data types combined with environmental dependence and uncorrelation between binding and regulation in ChIP-chip binding data complicate analyses that utilize these complimentary data sources. Therefore, to minimize the impact of these inaccuracies on transcription analyses it is desirable to identify instances of gene expression-ChIP-chip agreement, under the premise that inaccuracies are less likely to be present when separate data sources corroborate each other. Current methods for such identification either make key assumptions that limit their applicability and/or yield high false positive and false negative rates. The goal of this work was to develop a method with a minimal amount of assumptions, and thus widely applicable, that can identify agreement between gene expression and ChIP-chip data at a higher confidence level than current methods. RESULTS We demonstrate in Saccharomyces cerevisiae that currently available ChIP-chip binding data explain microarray data from a variety of environments only as well as randomized networks with the same connectivity density. This suggests a high degree of inconsistency between the two data types and illustrates the need for a method that can identify consistency between the two data sources. Here we have developed a Gibbs sampling technique to identify genes whose expression and ChIP-chip binding data are mutually consistent. Compared to current methods that could perform the same task, the Gibbs sampling method developed here exceeds their ability at high levels (>50%) of transcription network and gene expression error, while performing similarly at lower levels. Using this technique, we show that on average 73% more gene expression features can be captured per gene as compared to the unfiltered use of gene expression and ChIP-chip-derived network connectivity data. It is important to note that the method described here can be generalized to other transcription connectivity data (e.g. sequence analysis, etc.). AVAILABILITY Our algorithm is available on request from the authors and soon to be posted on the web. See author's homepage for details, http://www.seas.ucla.edu/~liaoj/
Collapse
Affiliation(s)
- Mark P Brynildsen
- Department of Chemical and Biomolecular Engineering, University of California Los Angeles, CA 90095, USA
| | | | | |
Collapse
|
342
|
Wu WS, Li WH, Chen BS. Computational reconstruction of transcriptional regulatory modules of the yeast cell cycle. BMC Bioinformatics 2006; 7:421. [PMID: 17010188 PMCID: PMC1637117 DOI: 10.1186/1471-2105-7-421] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2006] [Accepted: 09/29/2006] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND A transcriptional regulatory module (TRM) is a set of genes that is regulated by a common set of transcription factors (TFs). By organizing the genome into TRMs, a living cell can coordinate the activities of many genes and carry out complex functions. Therefore, identifying TRMs is helpful for understanding gene regulation. RESULTS Integrating gene expression and ChIP-chip data, we develop a method, called MOdule Finding Algorithm (MOFA), for reconstructing TRMs of the yeast cell cycle. MOFA identified 87 TRMs, which together contain 336 distinct genes regulated by 40 TFs. Using various kinds of data, we validated the biological relevance of the identified TRMs. Our analysis shows that different combinations of a fairly small number of TFs are responsible for regulating a large number of genes involved in different cell cycle phases and that there may exist crosstalk between the cell cycle and other cellular processes. MOFA is capable of finding many novel TF-target gene relationships and can determine whether a TF is an activator or/and a repressor. Finally, MOFA refines some clusters proposed by previous studies and provides a better understanding of how the complex expression program of the cell cycle is regulated. CONCLUSION MOFA was developed to reconstruct TRMs of the yeast cell cycle. Many of these TRMs are in agreement with previous studies. Further, MOFA inferred many interesting modules and novel TF combinations. We believe that computational analysis of multiple types of data will be a powerful approach to studying complex biological systems when more and more genomic resources such as genome-wide protein activity data and protein-protein interaction data become available.
Collapse
Affiliation(s)
- Wei-Sheng Wu
- Lab of Control and Systems Biology, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 300, Taiwan
| | - Wen-Hsiung Li
- Department of Evolution and Ecology, University of Chicago, 1101 East 57th Street, Chicago, IL, 60637, USA
- Genomics Research Center, Academia Sinica, Taipei, Taiwan
| | - Bor-Sen Chen
- Lab of Control and Systems Biology, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 300, Taiwan
| |
Collapse
|
343
|
Sanguinetti G, Lawrence ND, Rattray M. Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities. ACTA ACUST UNITED AC 2006; 22:2775-81. [PMID: 16966362 DOI: 10.1093/bioinformatics/btl473] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. Recent experimental high-throughput techniques, such as Chromatin Immunoprecipitation (ChIP) provide important information about the architecture of the regulatory networks in the cell. However, it is very difficult to measure the concentration levels of transcription factor proteins and determine their regulatory effect on gene transcription. It is therefore an important computational challenge to infer these quantities using gene expression data and network architecture data. RESULTS We develop a probabilistic state space model that allows genome-wide inference of both transcription factor protein concentrations and their effect on the transcription rates of each target gene from microarray data. We use variational inference techniques to learn the model parameters and perform posterior inference of protein concentrations and regulatory strengths. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates, as well as providing a tool to detect which binding events lead to significant regulation. We demonstrate our model on artificial data and on two yeast datasets in which the network structure has previously been obtained using ChIP data. Predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell. AVAILABILITY MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma
Collapse
Affiliation(s)
- Guido Sanguinetti
- Department of Computer Science, Regent Court 211 Portobello Road, Sheffield, S1 4DP, UK.
| | | | | |
Collapse
|
344
|
Cokus S, Rose S, Haynor D, Grønbech-Jensen N, Pellegrini M. Modelling the network of cell cycle transcription factors in the yeast Saccharomyces cerevisiae. BMC Bioinformatics 2006; 7:381. [PMID: 16914048 PMCID: PMC1570153 DOI: 10.1186/1471-2105-7-381] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2006] [Accepted: 08/16/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Reverse-engineering regulatory networks is one of the central challenges for computational biology. Many techniques have been developed to accomplish this by utilizing transcription factor binding data in conjunction with expression data. Of these approaches, several have focused on the reconstruction of the cell cycle regulatory network of Saccharomyces cerevisiae. The emphasis of these studies has been to model the relationships between transcription factors and their target genes. In contrast, here we focus on reverse-engineering the network of relationships among transcription factors that regulate the cell cycle in S. cerevisiae. RESULTS We have developed a technique to reverse-engineer networks of the time-dependent activities of transcription factors that regulate the cell cycle in S. cerevisiae. The model utilizes linear regression to first estimate the activities of transcription factors from expression time series and genome-wide transcription factor binding data. We then use least squares to construct a model of the time evolution of the activities. We validate our approach in two ways: by demonstrating that it accurately models expression data and by demonstrating that our reconstructed model is similar to previously-published models of transcriptional regulation of the cell cycle. CONCLUSION Our regression-based approach allows us to build a general model of transcriptional regulation of the yeast cell cycle that includes additional factors and couplings not reported in previously-published models. Our model could serve as a starting point for targeted experiments that test the predicted interactions. In the future, we plan to apply our technique to reverse-engineer other systems where both genome-wide time series expression data and transcription factor binding data are available.
Collapse
Affiliation(s)
- Shawn Cokus
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, USA
| | - Sherri Rose
- Department of Biostatistics, University of California, Berkeley, CA, USA
| | - David Haynor
- Department of Radiology, University of Washington, WA, USA
| | | | - Matteo Pellegrini
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, USA
| |
Collapse
|
345
|
Abstract
Machine learning offers a principled approach for developing sophisticated, automatic, and objective algorithms for analysis of high-dimensional and multimodal biomedical data. This review focuses on several advances in the state of the art that have shown promise in improving detection, diagnosis, and therapeutic monitoring of disease. Key in the advancement has been the development of a more in-depth understanding and theoretical analysis of critical issues related to algorithmic construction and learning theory. These include trade-offs for maximizing generalization performance, use of physically realistic constraints, and incorporation of prior knowledge and uncertainty. The review describes recent developments in machine learning, focusing on supervised and unsupervised linear methods and Bayesian inference, which have made significant impacts in the detection and diagnosis of disease in biomedicine. We describe the different methodologies and, for each, provide examples of their application to specific domains in biomedical diagnostics.
Collapse
Affiliation(s)
- Paul Sajda
- Department of Biomedical Engineering, Columbia University, New York, NY 10027, USA.
| |
Collapse
|
346
|
Brynildsen MP, Tran LM, Liao JC. Versatility and connectivity efficiency of bipartite transcription networks. Biophys J 2006; 91:2749-59. [PMID: 16815895 PMCID: PMC1578464 DOI: 10.1529/biophysj.106.082560] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The modulation of promoter activity by DNA-binding transcription regulators forms a bipartite network between the regulators and genes, in which a smaller number of regulators control a much lager number of genes. To facilitate representation of gene expression data with the simplest possible network structure, we have characterized the ability of bipartite networks to describe data. This has led to the classification of two types of bipartite networks, versatile and nonversatile. Versatile networks can describe any data of the same rank, and are indistinguishable from one another. Nonversatile networks require constraints to be present in data they describe, which may be used to distinguish between different network topologies. By quantifying the ability of bipartite networks to represent data we were able to define connectivity efficiency, which is a measure of how economic the use of connections is within a network with respect to data representation and generation. We postulated that it may be desirable for an organism to maximize its gene expression range per network edge, since development of a regulatory connection may have some evolutionary cost. We found that the transcriptional regulatory networks of both Saccharomyces cerevisiae and Escherichia coli lie close to their respective connectivity efficiency maxima, suggesting that connectivity efficiency may have some evolutionary influence.
Collapse
Affiliation(s)
- Mark P Brynildsen
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California, USA
| | | | | |
Collapse
|
347
|
Galbraith SJ, Tran LM, Liao JC. Transcriptome network component analysis with limited microarray data. Bioinformatics 2006; 22:1886-94. [PMID: 16766556 DOI: 10.1093/bioinformatics/btl279] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED Network component analysis (NCA) is a method to deduce transcription factor (TF) activities and TF-gene regulation control strengths from gene expression data and a TF-gene binding connectivity network. Previously, this method could analyze a maximum number of regulators equal to the total sample size because of the identifiability limit in data decomposition. As such, the total number of source signal components was limited to the total number of experiments rather than the total number of biological regulators. However, networks that have less transcriptome data points than the number of regulators are of interest. Thus it is imperative to develop a theoretical basis that allows realistic source signal extraction based on relatively few data points. On the other hand, such methods would inherently increase numerical challenges leading to multiple solutions. Therefore, solutions to both the problems are needed. RESULTS We have improved NCA for transcription factor activity (TFA) estimation, based on the observation that most genes are regulated by only a few TFs. This observation leads to the derivation of a new identifiability criterion which is tested during numerical iteration that allows us to decompose data when the number of TFs is greater than the number of experiments. To show that our method works with real microarray data and has biological utility, we analyze Saccharomyces cerevisiae cell cycle microarray data (73 experiments) using a TF-gene connectivity network (96 TFs) derived from ChIP-chip binding data. We compare the results of NCA analysis with the results obtained from ChIP-chip regression methods, and we show that NCA and regression produce TFAs that are qualitatively similar, but the NCA TFAs outperform regression in statistical tests. We also show that NCA can extract subtle TFA signals that correlate with known cell cycle TF function and cell cycle phase. Overall we determined that 31 TFs have statistically periodic TFAs in one or more experiments, 75% of which are known cell cycle regulators. In addition, we find that the 12 TFAs that are periodic in two or more experiments correspond to well-known cell cycle regulators. We also investigated TFA sensitivity to the choice of connectivity network we constructed two networks using different ChIP-chip p-value cut-offs. AVAILABILITY The NCA Toolbox for MATLAB is available at http://www.seas.ucla.edu/~liaoj/download.htm.
Collapse
Affiliation(s)
- Simon J Galbraith
- Department of Computer Science, University of California Los Angeles, CA, USA
| | | | | |
Collapse
|
348
|
|
349
|
Wendisch VF, Bott M, Kalinowski J, Oldiges M, Wiechert W. Emerging Corynebacterium glutamicum systems biology. J Biotechnol 2006; 124:74-92. [PMID: 16406159 DOI: 10.1016/j.jbiotec.2005.12.002] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2005] [Revised: 10/12/2005] [Accepted: 12/01/2005] [Indexed: 10/25/2022]
Abstract
Corynebacterium glutamicum is widely used for the biotechnological production of amino acids. Amino acid producing strains have been improved classically by mutagenesis and screening as well as in a rational manner using recombinant DNA technology. Metabolic flux analysis may be viewed as the first systems approach to C. glutamicum physiology since it combines isotope labeling data with metabolic network models of the biosynthetic and central metabolic pathways. However, only the complete genome sequence of C. glutamicum and post-genomics methods such as transcriptomics and proteomics have allowed characterizing metabolic and regulatory properties of this bacterium on a truly global level. Besides transcriptomics and proteomics, metabolomics and modeling approaches have now been established. Systems biology, which uses systematic genomic, proteomic and metabolomic technologies with the final aim of constructing comprehensive and predictive models of complex biological systems, is emerging for C. glutamicum. We will present current developments that advanced our insight into fundamental biology of C. glutamicum and that in the future will enable novel biotechnological applications for the improvement of amino acid production.
Collapse
|
350
|
Tuck DP, Kluger HM, Kluger Y. Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics 2006; 7:236. [PMID: 16670008 PMCID: PMC1482723 DOI: 10.1186/1471-2105-7-236] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2005] [Accepted: 05/02/2006] [Indexed: 11/20/2022] Open
Abstract
Background High throughput gene expression experiments yield large amounts of data that can augment our understanding of disease processes, in addition to classifying samples. Here we present new paradigms of data Separation based on construction of transcriptional regulatory networks for normal and abnormal cells using sequence predictions, literature based data and gene expression studies. We analyzed expression datasets from a number of diseased and normal cells, including different types of acute leukemia, and breast cancer with variable clinical outcome. Results We constructed sample-specific regulatory networks to identify links between transcription factors (TFs) and regulated genes that differentiate between healthy and diseased states. This approach carries the advantage of identifying key transcription factor-gene pairs with differential activity between healthy and diseased states rather than merely using gene expression profiles, thus alluding to processes that may be involved in gene deregulation. We then generalized this approach by studying simultaneous changes in functionality of multiple regulatory links pointing to a regulated gene or emanating from one TF (or changes in gene centrality defined by its in-degree or out-degree measures, respectively). We found that samples can often be separated based on these measures of gene centrality more robustly than using individual links. We examined distributions of distances (the number of links needed to traverse the path between each pair of genes) in the transcriptional networks for gene subsets whose collective expression profiles could best separate each dataset into predefined groups. We found that genes that optimally classify samples are concentrated in neighborhoods in the gene regulatory networks. This suggests that genes that are deregulated in diseased states exhibit a remarkable degree of connectivity. Conclusion Transcription factor-regulated gene links and centrality of genes on transcriptional networks can be used to differentiate between cell types. Transcriptional network blueprints can be used as a basis for further research into gene deregulation in diseased states.
Collapse
Affiliation(s)
- David P Tuck
- Department of Pathology, Yale University School of Medicine, New Haven, Connecticut 06510, USA
| | - Harriet M Kluger
- Department of Internat Medicine, Yale University School of Medicine, New Haven, Connecticut 06510, USA
| | - Yuval Kluger
- Department of Cell Biology, New York University School of Medicine, New York, New York 10016, USA
| |
Collapse
|