151
|
Hu M, Qin ZS. Query large scale microarray compendium datasets using a model-based bayesian approach with variable selection. PLoS One 2009; 4:e4495. [PMID: 19214232 PMCID: PMC2637418 DOI: 10.1371/journal.pone.0004495] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2008] [Accepted: 12/06/2008] [Indexed: 11/19/2022] Open
Abstract
In microarray gene expression data analysis, it is often of interest to identify genes that share similar expression profiles with a particular gene such as a key regulatory protein. Multiple studies have been conducted using various correlation measures to identify co-expressed genes. While working well for small datasets, the heterogeneity introduced from increased sample size inevitably reduces the sensitivity and specificity of these approaches. This is because most co-expression relationships do not extend to all experimental conditions. With the rapid increase in the size of microarray datasets, identifying functionally related genes from large and diverse microarray gene expression datasets is a key challenge. We develop a model-based gene expression query algorithm built under the Bayesian model selection framework. It is capable of detecting co-expression profiles under a subset of samples/experimental conditions. In addition, it allows linearly transformed expression patterns to be recognized and is robust against sporadic outliers in the data. Both features are critically important for increasing the power of identifying co-expressed genes in large scale gene expression datasets. Our simulation studies suggest that this method outperforms existing correlation coefficients or mutual information-based query tools. When we apply this new method to the Escherichia coli microarray compendium data, it identifies a majority of known regulons as well as novel potential target genes of numerous key transcription factors.
Collapse
Affiliation(s)
- Ming Hu
- Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Zhaohui S. Qin
- Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail:
| |
Collapse
|
152
|
A new procedure to optimize the selection of groups in a classification tree: Applications for ecological data. Ecol Modell 2009. [DOI: 10.1016/j.ecolmodel.2008.11.006] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
153
|
Ying L, Sarwal M. In praise of arrays. Pediatr Nephrol 2009; 24:1643-59; quiz 1655, 1659. [PMID: 18568367 PMCID: PMC2719727 DOI: 10.1007/s00467-008-0808-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/12/2007] [Revised: 02/26/2008] [Accepted: 02/27/2008] [Indexed: 11/29/2022]
Abstract
Microarray technologies have both fascinated and frustrated the transplant community since their introduction roughly a decade ago. Fascination arose from the possibility offered by the technology to gain a profound insight into the cellular response to immunogenic injury and the potential that this genomic signature would be indicative of the biological mechanism by which that stress was induced. Frustrations have arisen primarily from technical factors such as data variance, the requirement for the application of advanced statistical and mathematical analyses, and difficulties associated with actually recognizing signature gene-expression patterns and discerning mechanisms. To aid the understanding of this powerful tool, its versatility, and how it is dramatically changing the molecular approach to biomedical and clinical research, this teaching review describes the technology and its applications, as well as the limitations and evolution of microarrays, in the field of organ transplantation. Finally, it calls upon the attention of the transplant community to integrate into multidisciplinary teams, to take advantage of this technology and its expanding applications in unraveling the complex injury circuits that currently limit transplant survival.
Collapse
Affiliation(s)
- Lihua Ying
- Department of Pediatrics, Stanford University, G320, 300 Pasteur Drive, Stanford, CA 94305 USA
| | - Minnie Sarwal
- Department of Pediatrics, Stanford University, G320, 300 Pasteur Drive, Stanford, CA 94305 USA
| |
Collapse
|
154
|
Erten C, Sözdinler M. Biclustering Expression Data Based on Expanding Localized Substructures. BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2009. [DOI: 10.1007/978-3-642-00727-9_22] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
155
|
DiMaggio PA, McAllister SR, Floudas CA, Feng XJ, Rabinowitz JD, Rabitz HA. Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics 2008; 9:458. [PMID: 18954459 PMCID: PMC2605474 DOI: 10.1186/1471-2105-9-458] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2008] [Accepted: 10/27/2008] [Indexed: 11/29/2022] Open
Abstract
Background The analysis of large-scale data sets via clustering techniques is utilized in a number of applications. Biclustering in particular has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Biclustering algorithms also have important applications in sample classification where, for instance, tissue samples can be classified as cancerous or normal. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the "best" grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters. Results In this article, we present a rigorous approach to biclustering, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. Cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices to generate biclusters. The performance of OREO is tested on (a) metabolite concentration data, (b) an image reconstruction matrix, (c) synthetic data with implanted biclusters, and gene expression data for (d) colon cancer data, (e) breast cancer data, as well as (f) yeast segregant data to validate the ability of the proposed method and compare it to existing biclustering and clustering methods. Conclusion We demonstrate that this rigorous global optimization method for biclustering produces clusters with more insightful groupings of similar entities, such as genes or metabolites sharing common functions, than other clustering and biclustering algorithms and can reconstruct underlying fundamental patterns in the data for several distinct sets of data matrices arising in important biological applications.
Collapse
Affiliation(s)
- Peter A DiMaggio
- Department of Chemical Engineering, Princeton University, Princeton, NJ, USA.
| | | | | | | | | | | |
Collapse
|
156
|
Christinat Y, Wachmann B, Zhang L. Gene expression data analysis using a novel approach to biclustering combining discrete and continuous data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:583-593. [PMID: 18989045 DOI: 10.1109/tcbb.2007.70251] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Many different methods exist for pattern detection in gene expression data. In contrast to classical methods, biclustering has the ability to cluster a group of genes together with a group of conditions (replicates, set of patients or drug compounds). However, since the problem is NP-complex, most algorithms use heuristic search functions and therefore might converge towards local maxima. By using the results of biclustering on discrete data as a starting point for a local search function on continuous data, our algorithm avoids the problem of heuristic initialization. Similar to OPSM, our algorithm aims to detect biclusters whose rows and columns can be ordered such that row values are growing across the bicluster's columns and vice-versa. Results have been generated on the yeast genome (Saccharomyces cerevisiae), a human cancer dataset and random data. Results on the yeast genome showed that 89% of the one hundred biggest non-overlapping biclusters were enriched with Gene Ontology annotations. A comparison with OPSM and ISA demonstrated a better efficiency when using gene and condition orders. We present results on random and real datasets that show the ability of our algorithm to capture statistically significant and biologically relevant biclusters.
Collapse
Affiliation(s)
- Yann Christinat
- Laboratory for Computational Biology and Bioinformatics, School of Computer and Communication Sciences, Ecole Polytechnique Fédérale de Lausanne, Station 14, CH-1015 Lausanne, Switzerland.
| | | | | |
Collapse
|
157
|
Madi A, Friedman Y, Roth D, Regev T, Bransburg-Zabary S, Jacob EB. Genome holography: deciphering function-form motifs from gene expression data. PLoS One 2008; 3:e2708. [PMID: 18628959 PMCID: PMC2444029 DOI: 10.1371/journal.pone.0002708] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2008] [Accepted: 06/19/2008] [Indexed: 12/28/2022] Open
Abstract
Background DNA chips allow simultaneous measurements of genome-wide response of thousands of genes, i.e. system level monitoring of the gene-network activity. Advanced analysis methods have been developed to extract meaningful information from the vast amount of raw gene-expression data obtained from the microarray measurements. These methods usually aimed to distinguish between groups of subjects (e.g., cancer patients vs. healthy subjects) or identifying marker genes that help to distinguish between those groups. We assumed that motifs related to the internal structure of operons and gene-networks regulation are also embedded in microarray and can be deciphered by using proper analysis. Methodology/Principal Findings The analysis presented here is based on investigating the gene-gene correlations. We analyze a database of gene expression of Bacillus subtilis exposed to sub-lethal levels of 37 different antibiotics. Using unsupervised analysis (dendrogram) of the matrix of normalized gene-gene correlations, we identified the operons as they form distinct clusters of genes in the sorted correlation matrix. Applying dimension-reduction algorithm (Principal Component Analysis, PCA) to the matrices of normalized correlations reveals functional motifs. The genes are placed in a reduced 3-dimensional space of the three leading PCA eigen-vectors according to their corresponding eigen-values. We found that the organization of the genes in the reduced PCA space recovers motifs of the operon internal structure, such as the order of the genes along the genome, gene separation by non-coding segments, and translational start and end regions. In addition to the intra-operon structure, it is also possible to predict inter-operon relationships, operons sharing functional regulation factors, and more. In particular, we demonstrate the above in the context of the competence and sporulation pathways. Conclusions/Significance We demonstrated that by analyzing gene-gene correlation from gene-expression data it is possible to identify operons and to predict unknown internal structure of operons and gene-networks regulation.
Collapse
Affiliation(s)
- Asaf Madi
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Yonatan Friedman
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Computational and Systems Biology, Massachusetts Institute of Technology (MIT), Boston, Massachusetts, United States of America
| | - Dalit Roth
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Tamar Regev
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Sharron Bransburg-Zabary
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eshel Ben Jacob
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- The Center for Theoretical and Biological Physics, University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
158
|
Murat A, Migliavacca E, Gorlia T, Lambiv WL, Shay T, Hamou MF, de Tribolet N, Regli L, Wick W, Kouwenhoven MCM, Hainfellner JA, Heppner FL, Dietrich PY, Zimmer Y, Cairncross JG, Janzer RC, Domany E, Delorenzi M, Stupp R, Hegi ME. Stem cell-related "self-renewal" signature and high epidermal growth factor receptor expression associated with resistance to concomitant chemoradiotherapy in glioblastoma. J Clin Oncol 2008; 26:3015-24. [PMID: 18565887 DOI: 10.1200/jco.2007.15.7164] [Citation(s) in RCA: 548] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
PURPOSE Glioblastomas are notorious for resistance to therapy, which has been attributed to DNA-repair proficiency, a multitude of deregulated molecular pathways, and, more recently, to the particular biologic behavior of tumor stem-like cells. Here, we aimed to identify molecular profiles specific for treatment resistance to the current standard of care of concomitant chemoradiotherapy with the alkylating agent temozolomide. PATIENTS AND METHODS Gene expression profiles of 80 glioblastomas were interrogated for associations with resistance to therapy. Patients were treated within clinical trials testing the addition of concomitant and adjuvant temozolomide to radiotherapy. RESULTS An expression signature dominated by HOX genes, which comprises Prominin-1 (CD133), emerged as a predictor for poor survival in patients treated with concomitant chemoradiotherapy (n = 42; hazard ratio = 2.69; 95% CI, 1.38 to 5.26; P = .004). This association could be validated in an independent data set. Provocatively, the HOX cluster was reminiscent of a "self-renewal" signature (P = .008; Gene Set Enrichment Analysis) recently characterized in a mouse leukemia model. The HOX signature and EGFR expression were independent prognostic factors in multivariate analysis, adjusted for the O-6-methylguanine-DNA methyltransferase (MGMT) methylation status, a known predictive factor for benefit from temozolomide, and age. Better outcome was associated with gene clusters characterizing features of tumor-host interaction including tumor vascularization and cell adhesion, and innate immune response. CONCLUSION This study provides first clinical evidence for the implication of a "glioma stem cell" or "self-renewal" phenotype in treatment resistance of glioblastoma. Biologic mechanisms identified here to be relevant for resistance will guide future targeted therapies and respective marker development for individualized treatment and patient selection.
Collapse
Affiliation(s)
- Anastasia Murat
- Laboratory of Tumor Biology and Genetics, Centre Universitaire Romand de Neurochirurgie, Centre Hospitalier Universitaire Vaudois and University of Lausanne, Lausanne 1011, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
159
|
Cho H, Dhillon IS. Coclustering of human cancer microarrays using Minimum Sum-Squared Residue coclustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:385-400. [PMID: 18670042 DOI: 10.1109/tcbb.2007.70268] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes as well as macroscopic phenotypes of related samples. In order to simultaneously cluster genes and conditions, we have previously developed a fast co-clustering algorithm, Minimum Sum-Squared Residue Co-clustering (MSSRCC), which employs an alternating minimization scheme and generates what we call co-clusters in a checkerboard structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression datasets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing co-clustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting co-clusters in a checkerboard structure, where genes in a co-cluster manifest the phenotype structure of corresponding specific samples, and evaluate the enrichment of functional annotations in Gene Ontology (GO).
Collapse
Affiliation(s)
- Hyuk Cho
- Department of Computer Science, The University of Texas at Austin, 1 University Station C0500, Austin, TX 78712, USA.
| | | |
Collapse
|
160
|
TABAKMAN RINAT, JIANG HAO, SHAHAR IRIS, ARIEN-ZAKAY HADAR, LEVINE ROBERTA, LAZAROVICI PHILIP. Neuroprotection by NGF in the PC12 In Vitro OGD Model. Ann N Y Acad Sci 2008. [DOI: 10.1111/j.1749-6632.2005.tb00013.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
161
|
Mejía-Roa E, Carmona-Saez P, Nogales R, Vicente C, Vázquez M, Yang XY, García C, Tirado F, Pascual-Montano A. bioNMF: a web-based tool for nonnegative matrix factorization in biology. Nucleic Acids Res 2008; 36:W523-8. [PMID: 18515346 PMCID: PMC2447803 DOI: 10.1093/nar/gkn335] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In the last few years, advances in high-throughput technologies are generating large amounts of biological data that require analysis and interpretation. Nonnegative matrix factorization (NMF) has been established as a very effective method to reveal information about the complex latent relationships in experimental data sets. Using this method as part of the exploratory data analysis, workflow would certainly help in the process of interpreting and understanding the complex biology mechanisms that are underlying experimental data. We have developed bioNMF, a web-based tool that implements the NMF methodology in different analysis contexts to support some of the most important reported applications in biology. This online tool provides a user-friendly interface, combined with a computational efficient parallel implementation of the NMF methods to explore the data in different analysis scenarios. In addition to the online access, bioNMF also provides the same functionality included in the website as a public web services interface, enabling users with more computer expertise to launch jobs into bioNMF server from their own scripts and workflows. bioNMF application is freely available at http://bionmf.dacya.ucm.es.
Collapse
Affiliation(s)
- E Mejía-Roa
- Computer Architecture Department, Complutense University, Madrid, Spain
| | | | | | | | | | | | | | | | | |
Collapse
|
162
|
Varshavsky R, Horn D, Linial M. Global considerations in hierarchical clustering reveal meaningful patterns in data. PLoS One 2008; 3:e2247. [PMID: 18493326 PMCID: PMC2375056 DOI: 10.1371/journal.pone.0002247] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2007] [Accepted: 03/31/2008] [Indexed: 11/18/2022] Open
Abstract
Background A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. Methodology/Principal Findings We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. Conclusions Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations.
Collapse
Affiliation(s)
- Roy Varshavsky
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
| | | | | |
Collapse
|
163
|
In vivo genome-wide expression study on human circulating B cells suggests a novel ESR1 and MAPK3 network for postmenopausal osteoporosis. J Bone Miner Res 2008; 23:644-54. [PMID: 18433299 PMCID: PMC2674539 DOI: 10.1359/jbmr.080105] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
INTRODUCTION Osteoporosis is characterized by low BMD. Studies have shown that B cells may participate in osteoclastogenesis through expression of osteoclast-related factors, such as RANKL, transforming growth factor beta (TGFB), and osteoprotegerin (OPG). However, the in vivo significance of B cells in human bone metabolism and osteoporosis is still largely unknown, particularly at the systematic gene expression level. MATERIALS AND METHODS In this study, Affymetrix HG-U133A GeneChip arrays were used to identify genes differentially expressed in B cells between 10 low and 10 high BMD postmenopausal women. Significance of differential expression was tested by t-test and adjusted for multiple testing with the Benjamini and Hochberg (BH) procedure (adjusted p </= 0.05). RESULTS Twenty-nine genes were downregulated in the low versus high BMD group. These genes were further analyzed using Ingenuity Pathways Analysis (Ingenuity Systems). A network involving estrogen receptor 1 (ESR1) and mitogen activated protein kinase 3 (MAPK3) was identified. Real-time RT-PCR confirmed differential expression of eight genes, including ESR1, MAPK3, methyl CpG binding protein 2 (MECP2), proline-serine-threonine phosphatase interacting protein 1 (PSTPIP1), Scr-like-adaptor (SLA), serine/threonine kinase 11 (STK11), WNK lysine-deficient protein kinase 1 (WNK1), and zinc finger protein 446 (ZNF446). CONCLUSIONS This is the first in vivo genome-wide expression study on human B cells in relation to osteoporosis. Our results highlight the significance of B cells in the etiology of osteoporosis and suggest a novel mechanism for postmenopausal osteoporosis (i.e., that downregulation of ESR1 and MAPK3 in B cells regulates secretion of factors, leading to increased osteoclastogenesis or decreased osteoblastogenesis).
Collapse
|
164
|
A Simple Model of the Modular Structure of Transcriptional Regulation in Yeast. J Comput Biol 2008; 15:393-405. [DOI: 10.1089/cmb.2008.0020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
|
165
|
Gan X, Liew AWC, Yan H. Discovering biclusters in gene expression data based on high-dimensional linear geometries. BMC Bioinformatics 2008; 9:209. [PMID: 18433477 PMCID: PMC2386490 DOI: 10.1186/1471-2105-9-209] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2007] [Accepted: 04/23/2008] [Indexed: 11/10/2022] Open
Abstract
Background In DNA microarray experiments, discovering groups of genes that share similar transcriptional characteristics is instrumental in functional annotation, tissue classification and motif identification. However, in many situations a subset of genes only exhibits consistent pattern over a subset of conditions. Conventional clustering algorithms that deal with the entire row or column in an expression matrix would therefore fail to detect these useful patterns in the data. Recently, biclustering has been proposed to detect a subset of genes exhibiting consistent pattern over a subset of conditions. However, most existing biclustering algorithms are based on searching for sub-matrices within a data matrix by optimizing certain heuristically defined merit functions. Moreover, most of these algorithms can only detect a restricted set of bicluster patterns. Results In this paper, we present a novel geometric perspective for the biclustering problem. The biclustering process is interpreted as the detection of linear geometries in a high dimensional data space. Such a new perspective views biclusters with different patterns as hyperplanes in a high dimensional space, and allows us to handle different types of linear patterns simultaneously by matching a specific set of linear geometries. This geometric viewpoint also inspires us to propose a generic bicluster pattern, i.e. the linear coherent model that unifies the seemingly incompatible additive and multiplicative bicluster models. As a particular realization of our framework, we have implemented a Hough transform-based hyperplane detection algorithm. The experimental results on human lymphoma gene expression dataset show that our algorithm can find biologically significant subsets of genes. Conclusion We have proposed a novel geometric interpretation of the biclustering problem. We have shown that many common types of bicluster are just different spatial arrangements of hyperplanes in a high dimensional data space. An implementation of the geometric framework using the Fast Hough transform for hyperplane detection can be used to discover biologically significant subsets of genes under subsets of conditions for microarray data analysis.
Collapse
Affiliation(s)
- Xiangchao Gan
- Department of Computer Science, King's College London, UK.
| | | | | |
Collapse
|
166
|
Jagalur M, Pal C, Learned-Miller E, Zoeller RT, Kulp D. Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinformatics 2008; 8 Suppl 10:S5. [PMID: 18269699 PMCID: PMC2230506 DOI: 10.1186/1471-2105-8-s10-s5] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Background Many important high throughput projects use in situ hybridization and may require the analysis of images of spatial cross sections of organisms taken with cellular level resolution. Projects creating gene expression atlases at unprecedented scales for the embryonic fruit fly as well as the embryonic and adult mouse already involve the analysis of hundreds of thousands of high resolution experimental images mapping mRNA expression patterns. Challenges include accurate registration of highly deformed tissues, associating cells with known anatomical regions, and identifying groups of genes whose expression is coordinately regulated with respect to both concentration and spatial location. Solutions to these and other challenges will lead to a richer understanding of the complex system aspects of gene regulation in heterogeneous tissue. Results We present an end-to-end approach for processing raw in situ expression imagery and performing subsequent analysis. We use a non-linear, information theoretic based image registration technique specifically adapted for mapping expression images to anatomical annotations and a method for extracting expression information within an anatomical region. Our method consists of coarse registration, fine registration, and expression feature extraction steps. From this we obtain a matrix for expression characteristics with rows corresponding to genes and columns corresponding to anatomical sub-structures. We perform matrix block cluster analysis using a novel row-column mixture model and we relate clustered patterns to Gene Ontology (GO) annotations. Conclusion Resulting registrations suggest that our method is robust over intensity levels and shape variations in ISH imagery. Functional enrichment studies from both simple analysis and block clustering indicate that gene relationships consistent with biological knowledge of neuronal gene functions can be extracted from large ISH image databases such as the Allen Brain Atlas [1] and the Max-Planck Institute [2] using our method. While we focus here on imagery and experiments of the mouse brain our approach should be applicable to a variety of in situ experiments.
Collapse
Affiliation(s)
- Manjunatha Jagalur
- Department of Computer Science, University of Massachusetts Amherst, Amherst, MA-01003, USA.
| | | | | | | | | |
Collapse
|
167
|
Maere S, Van Dijck P, Kuiper M. Extracting expression modules from perturbational gene expression compendia. BMC SYSTEMS BIOLOGY 2008; 2:33. [PMID: 18402676 PMCID: PMC2386865 DOI: 10.1186/1752-0509-2-33] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/18/2007] [Accepted: 04/10/2008] [Indexed: 12/14/2022]
Abstract
Background Compendia of gene expression profiles under chemical and genetic perturbations constitute an invaluable resource from a systems biology perspective. However, the perturbational nature of such data imposes specific challenges on the computational methods used to analyze them. In particular, traditional clustering algorithms have difficulties in handling one of the prominent features of perturbational compendia, namely partial coexpression relationships between genes. Biclustering methods on the other hand are specifically designed to capture such partial coexpression patterns, but they show a variety of other drawbacks. For instance, some biclustering methods are less suited to identify overlapping biclusters, while others generate highly redundant biclusters. Also, none of the existing biclustering tools takes advantage of the staple of perturbational expression data analysis: the identification of differentially expressed genes. Results We introduce a novel method, called ENIGMA, that addresses some of these issues. ENIGMA leverages differential expression analysis results to extract expression modules from perturbational gene expression data. The core parameters of the ENIGMA clustering procedure are automatically optimized to reduce the redundancy between modules. In contrast to the biclusters produced by most other methods, ENIGMA modules may show internal substructure, i.e. subsets of genes with distinct but significantly related expression patterns. The grouping of these (often functionally) related patterns in one module greatly aids in the biological interpretation of the data. We show that ENIGMA outperforms other methods on artificial datasets, using a quality criterion that, unlike other criteria, can be used for algorithms that generate overlapping clusters and that can be modified to take redundancy between clusters into account. Finally, we apply ENIGMA to the Rosetta compendium of expression profiles for Saccharomyces cerevisiae and we analyze one pheromone response-related module in more detail, demonstrating the potential of ENIGMA to generate detailed predictions. Conclusion It is increasingly recognized that perturbational expression compendia are essential to identify the gene networks underlying cellular function, and efforts to build these for different organisms are currently underway. We show that ENIGMA constitutes a valuable addition to the repertoire of methods to analyze such data.
Collapse
Affiliation(s)
- Steven Maere
- Department of Plant Systems Biology, VIB, Technologiepark 927, B-9052 Ghent, Belgium.
| | | | | |
Collapse
|
168
|
Stanberry L, Murua A, Cordes D. Functional connectivity mapping using the ferromagnetic Potts spin model. Hum Brain Mapp 2008; 29:422-40. [PMID: 17497627 PMCID: PMC6871052 DOI: 10.1002/hbm.20397] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2006] [Revised: 01/25/2007] [Accepted: 02/26/2007] [Indexed: 11/12/2022] Open
Abstract
An unsupervised stochastic clustering method based on the ferromagnetic Potts spin model is introduced as a powerful tool to determine functionally connected regions. The method provides an intuitively simple approach to clustering and makes no assumptions of the number of clusters in the data or their underlying distribution. The performance of the method and its dependence on the intrinsic parameters (size of the neighborhood, form of the interaction term, etc.) is investigated on the simulated data and real fMRI data acquired during a conventional periodic finger tapping task. The merits of incorporating Euclidean information into the connectivity analysis are discussed. The ability of the Potts model clustering to uncover the hidden structure in the complex data is demonstrated through its application to the resting-state data to determine functional connectivity networks of the anterior and posterior cingulate cortices for the group of nine healthy male subjects.
Collapse
Affiliation(s)
- Larissa Stanberry
- Department of Statistics, University of Washington, Seattle, Washington 98195-4322, USA.
| | | | | |
Collapse
|
169
|
Kim C, Cheon M, Kang M, Chang I. A simple and exact Laplacian clustering of complex networking phenomena: application to gene expression profiles. Proc Natl Acad Sci U S A 2008; 105:4083-7. [PMID: 18337496 PMCID: PMC2393820 DOI: 10.1073/pnas.0708598105] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2007] [Indexed: 11/18/2022] Open
Abstract
Unraveling of the unified networking characteristics of complex networking phenomena is of great interest yet a formidable task. There is currently no simple strategy with a rigorous framework. Using an analogy to the exact algebraic property for a transition matrix of a master equation in statistical physics, we propose a method based on a Laplacian matrix for the discovery and prediction of new classes in the unsupervised complex networking phenomena where the class of each sample is completely unknown. Using this proposed Laplacian approach, we can simultaneously discover different classes and determine the identity of each class. Through an illustrative test of the Laplacian approach applied to real datasets of gene expression profiles, leukemia data [Golub TR, et al. (1999) Science 286:531-537], and lymphoma data [Alizadeh AA, et al. (2000) Nature 403:503-511], we demonstrate that this approach is accurate and robust with a mathematical and physical realization. It offers a general framework for characterizing any kind of complex networking phenomenon in broad areas irrespective of whether they are supervised or unsupervised.
Collapse
Affiliation(s)
| | - Mookyung Cheon
- National Research Laboratory for Computational Proteomics and Biophysics, Department of Physics, and
| | - Minho Kang
- Interdisciplinary Research Program of Bioinformatics, Pusan National University, Busan 609-735, Korea
| | - Iksoo Chang
- National Research Laboratory for Computational Proteomics and Biophysics, Department of Physics, and
| |
Collapse
|
170
|
|
171
|
|
172
|
Capobianco E. Model validation for gene selection and regulation maps. Funct Integr Genomics 2007; 8:87-99. [PMID: 18064499 DOI: 10.1007/s10142-007-0066-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2007] [Revised: 10/09/2007] [Accepted: 10/14/2007] [Indexed: 10/22/2022]
Abstract
Consider the problem of investigating the structure of a set of sample points in a very high dimensional (Euclidean) space. This case is paradigmatic, for instance, in postgenomic applications. The high dimensionality and small sample size make statistical inference and optimization difficult problems, such that selecting a model or choosing a learning algorithm face the evidence that currently no consensus guidelines exist. Usually, the intervention of linear or nonlinear projection method is required to map the observations into a low-dimensional space with the most salient data features preserved. This step usually involves computing statistics from the low-dimensional projected space of features and then inferring on the highly dimensional original structures (the genes). This work deals with model validation for gene selection and regulation dynamics. The analysis is conducted through a mix of quantitative methods and qualitative aspects. A regularized inference approach is employed based on dimensionality reduction, data denoising, and feature extraction tasks. Each task requires the implementation of statistics and machine learning algorithms. We focus on the complex problem of inferring the coregulation from the coexpression gene dynamics in the presence of limited biological information and time course perturbation experiments. In particular, both separation and interference gene dynamics are considered and validated to design the most coherent underlying transcriptional regulatory map.
Collapse
Affiliation(s)
- Enrico Capobianco
- CRS4 Bioinformatics Laboratory, Technology Park of Sardinia, Pula, Cagliari, Sardinia, Italy.
| |
Collapse
|
173
|
Zhao H, Liew AWC, Xie X, Yan H. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. J Theor Biol 2007; 251:264-74. [PMID: 18199458 DOI: 10.1016/j.jtbi.2007.11.030] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2007] [Revised: 10/17/2007] [Accepted: 11/29/2007] [Indexed: 11/30/2022]
Abstract
Biclustering is an important tool in microarray analysis when only a subset of genes co-regulates in a subset of conditions. Different from standard clustering analyses, biclustering performs simultaneous classification in both gene and condition directions in a microarray data matrix. However, the biclustering problem is inherently intractable and computationally complex. In this paper, we present a new biclustering algorithm based on the geometrical viewpoint of coherent gene expression profiles. In this method, we perform pattern identification based on the Hough transform in a column-pair space. The algorithm is especially suitable for the biclustering analysis of large-scale microarray data. Our studies show that the approach can discover significant biclusters with respect to the increased noise level and regulatory complexity. Furthermore, we also test the ability of our method to locate biologically verifiable biclusters within an annotated set of genes.
Collapse
Affiliation(s)
- Hongya Zhao
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
174
|
Wu Z, Irizarry RA. A statistical framework for the analysis of microarray probe-level data. Ann Appl Stat 2007. [DOI: 10.1214/07-aoas116] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
175
|
Saad R, Halgamuge SK, Li J. Polynomial kernel adaptation and extensions to the SVM classifier learning. Neural Comput Appl 2007. [DOI: 10.1007/s00521-006-0078-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
176
|
Sveshnikova AN, Ivanov PS. Biotechnology. Gene expression and microchips: Problems of the quantitative analysis. RUSS J GEN CHEM+ 2007. [DOI: 10.1134/s1070363207110369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
177
|
Katzenellenbogen M, Mizrahi L, Pappo O, Klopstock N, Olam D, Jacob-Hirsch J, Amariglio N, Rechavi G, Domany E, Galun E, Goldenberg D. Molecular mechanisms of liver carcinogenesis in the mdr2-knockout mice. Mol Cancer Res 2007; 5:1159-70. [PMID: 18025261 DOI: 10.1158/1541-7786.mcr-07-0172] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Mouse models of hepatocellular carcinoma (HCC) simulate specific subgroups of human HCC. We investigated hepatocarcinogenesis in Mdr2-knockout (Mdr2-KO) mice, a model of inflammation-associated HCC, using gene expression profiling and immunohistochemical analyses. Gene expression profiling showed that although Mdr2-KO mice differ from other published murine HCC models, they share several important deregulated pathways and many coordinately differentially expressed genes with human HCC data sets. Analysis of genome positions of differentially expressed genes in liver tumors revealed a prolonged region of down-regulated genes on murine chromosome 8 in three of the six analyzed tumor samples. This region is syntenic to human chromosomal regions that are frequently deleted in human HCC and harbor multiple tumor suppressor genes. Real-time reverse transcription-PCR analysis of 16 tumor samples confirmed down-regulation of several tumor suppressors in most tumors. We show that in the aged Mdr2-KO mice, cyclin D1 nuclear level is increased in dysplastic hepatocytes that do not form nodules; however, it is decreased in most dysplastic nodules and in liver tumors. We found that this decrease is mostly at the protein, rather than the mRNA, level. These findings raise the question on the role of cyclin D1 at early stages of hepatocarcinogenesis in the Mdr2-KO HCC model. Furthermore, we show that most liver tumors in Mdr2-KO mice were characterized by the absence of beta-catenin activation. In conclusion, the Mdr2-KO mouse may serve as a model for beta-catenin-negative subgroup of human HCCs characterized by low nuclear cyclin D1 levels in tumor cells and by down-regulation of multiple tumor suppressor genes.
Collapse
Affiliation(s)
- Mark Katzenellenbogen
- Goldyne Savad Institute of Gene Therapy, Hadassah-Hebrew University Medical Center, Kiryat Hadassah, P.O. Box 12000, Jerusalem 91120, Israel
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
178
|
Chen J, Dima RI, Thirumalai D. Allosteric Communication in Dihydrofolate Reductase: Signaling Network and Pathways for Closed to Occluded Transition and Back. J Mol Biol 2007; 374:250-66. [DOI: 10.1016/j.jmb.2007.08.047] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2007] [Revised: 08/01/2007] [Accepted: 08/08/2007] [Indexed: 10/22/2022]
|
179
|
Abstract
MOTIVATION It is an important and difficult task to extract gene network information from high-throughput genomic data. A common approach is to cluster genes using pairwise correlation as a distance metric. However, pairwise correlation is clearly too simplistic to describe the complex relationships among real genes since co-expression relationships are often restricted to a specific set of biological conditions/processes. In this study, we described a three-way gene interaction model that captures the dynamic nature of co-expression relationship between a gene pair through the introduction of a controller gene. RESULTS We surveyed 0.4 billion possible three-way interactions among 1000 genes in a microarray dataset containing 678 human cancer samples. To test the reproducibility and statistical significance of our results, we randomly split the samples into a training set and a testing set. We found that the gene triplets with the strongest interactions (i.e. with the smallest P-values from appropriate statistical tests) in the training set also had the strongest interactions in the testing set. A distinctive pattern of three-way interaction emerged from these gene triplets: depending on the third gene being expressed or not, the remaining two genes can be either co-expressed or mutually exclusive (i.e. expression of either one of them would repress the other). Such three-way interactions can exist without apparent pairwise correlations. The identified three-way interactions may constitute candidates for further experimentation using techniques such as RNA interference, so that novel gene network or pathways could be identified.
Collapse
Affiliation(s)
- Jiexin Zhang
- Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Unit 237, Houston, TX 77030-4009, USA
| | | | | |
Collapse
|
180
|
Zhang W, Li L, Li X, Jiang W, Huo J, Wang Y, Lin M, Rao S. Unravelling the hidden heterogeneities of diffuse large B-cell lymphoma based on coupled two-way clustering. BMC Genomics 2007; 8:332. [PMID: 17888167 PMCID: PMC2082044 DOI: 10.1186/1471-2164-8-332] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2006] [Accepted: 09/22/2007] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND It becomes increasingly clear that our current taxonomy of clinical phenotypes is mixed with molecular heterogeneity. Of vital importance for refined clinical practice and improved intervention strategies is to define the hidden molecular distinct diseases using modern large-scale genomic approaches. Microarray omics technology has provided a powerful way to dissect hidden genetic heterogeneity of complex diseases. The aim of this study was thus to develop a bioinformatics approach to seek the transcriptional features leading to the hidden subtyping of a complex clinical phenotype. The basic strategy of the proposed method was to iteratively partition in two ways sample and feature space with super-paramagnetic clustering technique and to seek for hard and robust gene clusters that lead to a natural partition of disease samples and that have the highest functionally conceptual consensus evaluated with Gene Ontology. RESULTS We applied the proposed method to two publicly available microarray datasets of diffuse large B-cell lymphoma (DLBCL), a notoriously heterogeneous phenotype. A feature subset of 30 genes (38 probes) derived from analysis of the first dataset consisting of 4026 genes and 42 DLBCL samples identified three categories of patients with very different five-year overall survival rates (70.59%, 44.44% and 14.29% respectively; p = 0.0017). Analysis of the second dataset consisting of 7129 genes and 58 DLBCL samples revealed a feature subset of 13 genes (16 probes) that not only replicated the findings of the important DLBCL genes (e.g. JAW1 and BCL7A), but also identified three clinically similar subtypes (with 5-year overall survival rates of 63.13%, 34.92% and 15.38% respectively; p = 0.0009) to those identified in the first dataset. Finally, we built a multivariate Cox proportional-hazards prediction model for each feature subset and defined JAW1 as one of the most significant predictor (p = 0.005 and 0.014; hazard ratios = 0.02 and 0.03, respectively for two datasets) for both DLBCL cohorts under study. CONCLUSION Our results showed that the proposed algorithm is a promising computational strategy for peeling off the hidden genetic heterogeneity based on transcriptionally profiling disease samples, which may lead to an improved diagnosis and treatment of cancers.
Collapse
Affiliation(s)
- Wei Zhang
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
| | - Li Li
- Institute of Medical Genetics, Tongji University, Shanghai 200092, China
| | - Xia Li
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
- Institute of Medical Genetics, Tongji University, Shanghai 200092, China
- Department of Computer Science, Harbin Institute of Technology, Harbin 150080, China
- The Biomedical Engineering Institute, Capital Medical University, Beijing 100054, China
| | - Wei Jiang
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
| | - Jianmin Huo
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
| | - Yadong Wang
- Department of Computer Science, Harbin Institute of Technology, Harbin 150080, China
| | - Meihua Lin
- The Biomedical Engineering Institute, Capital Medical University, Beijing 100054, China
- Department of Molecular Cardiology, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Shaoqi Rao
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
- The Biomedical Engineering Institute, Capital Medical University, Beijing 100054, China
- Department of Molecular Cardiology, Cleveland Clinic, Cleveland, OH 44195, USA
| |
Collapse
|
181
|
Shaik JS, Yeasin M. A unified framework for finding differentially expressed genes from microarray experiments. BMC Bioinformatics 2007; 8:347. [PMID: 17877806 PMCID: PMC2099446 DOI: 10.1186/1471-2105-8-347] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2007] [Accepted: 09/18/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework. RESULTS The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets. CONCLUSION This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.
Collapse
Affiliation(s)
- Jahangheer S Shaik
- Department of Electrical and Computer Engineering, CVPIA Lab, University of Memphis, Memphis, TN-38152, USA
| | - Mohammed Yeasin
- Department of Electrical and Computer Engineering, CVPIA Lab, University of Memphis, Memphis, TN-38152, USA
- Bioinformatics Program, CVPIA Lab, University of Memphis, Memphis, TN-38152, USA
- Biomedical Engineering, CVPIA Lab, University of Memphis, Memphis, TN-38152, USA
- 4Center for Advanced Robotics, CVPIA Lab, University of Memphis, Memphis, TN-38152, USA
- Software Testing and Excellence Program University of Memphis, Memphis, TN-38152, USA
| |
Collapse
|
182
|
Oh MS, Raftery AE. Model-Based Clustering With Dissimilarities: A Bayesian Approach. J Comput Graph Stat 2007. [DOI: 10.1198/106186007x236127] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
183
|
Trevino V, Falciani F, Barrera-Saldaña HA. DNA microarrays: a powerful genomic tool for biomedical and clinical research. Mol Med 2007; 13:527-41. [PMID: 17660860 PMCID: PMC1933257 DOI: 10.2119/2006-00107.trevino] [Citation(s) in RCA: 122] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2006] [Accepted: 07/02/2007] [Indexed: 12/11/2022] Open
Abstract
Among the many benefits of the Human Genome Project are new and powerful tools such as the genome-wide hybridization devices referred to as microarrays. Initially designed to measure gene transcriptional levels, microarray technologies are now used for comparing other genome features among individuals and their tissues and cells. Results provide valuable information on disease subcategories, disease prognosis, and treatment outcome. Likewise, they reveal differences in genetic makeup, regulatory mechanisms, and subtle variations and move us closer to the era of personalized medicine. To understand this powerful tool, its versatility, and how dramatically it is changing the molecular approach to biomedical and clinical research, this review describes the technology, its applications, a didactic step-by-step review of a typical microarray protocol, and a real experiment. Finally, it calls the attention of the medical community to the importance of integrating multidisciplinary teams to take advantage of this technology and its expanding applications that, in a slide, reveals our genetic inheritance and destiny.
Collapse
Affiliation(s)
- Victor Trevino
- Institute Tecnológico y de Estudios Superiores de Monterrey, Monterrey, Nuevo León, México
- School of Biosciences, University of Birmingham, Birmingham, United Kingdom
| | - Francesco Falciani
- School of Biosciences, University of Birmingham, Birmingham, United Kingdom
| | - Hugo A. Barrera-Saldaña
- Laboratorio de Genómica y Bioinformática del ULIEG. Departamento de Bioquímica, Facultad de Medicina de la Universidad Autónoma de Nuevo León. Monterrey, N.L. México
| |
Collapse
|
184
|
Bongard J, Lipson H. Automated reverse engineering of nonlinear dynamical systems. Proc Natl Acad Sci U S A 2007; 104:9943-8. [PMID: 17553966 PMCID: PMC1891254 DOI: 10.1073/pnas.0609476104] [Citation(s) in RCA: 169] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2006] [Indexed: 11/18/2022] Open
Abstract
Complex nonlinear dynamics arise in many fields of science and engineering, but uncovering the underlying differential equations directly from observations poses a challenging task. The ability to symbolically model complex networked systems is key to understanding them, an open problem in many disciplines. Here we introduce for the first time a method that can automatically generate symbolic equations for a nonlinear coupled dynamical system directly from time series data. This method is applicable to any system that can be described using sets of ordinary nonlinear differential equations, and assumes that the (possibly noisy) time series of all variables are observable. Previous automated symbolic modeling approaches of coupled physical systems produced linear models or required a nonlinear model to be provided manually. The advance presented here is made possible by allowing the method to model each (possibly coupled) variable separately, intelligently perturbing and destabilizing the system to extract its less observable characteristics, and automatically simplifying the equations during modeling. We demonstrate this method on four simulated and two real systems spanning mechanics, ecology, and systems biology. Unlike numerical models, symbolic models have explanatory value, suggesting that automated "reverse engineering" approaches for model-free symbolic nonlinear system identification may play an increasing role in our ability to understand progressively more complex systems in the future.
Collapse
Affiliation(s)
- Josh Bongard
- Mechanical and Aerospace Engineering and Computing and Information Science, Cornell University, Ithaca, NY 14853, USA.
| | | |
Collapse
|
185
|
Redestig H, Repsilber D, Sohler F, Selbig J. Integrating functional knowledge during sample clustering for microarray data using unsupervised decision trees. Biom J 2007; 49:214-29. [PMID: 17476945 DOI: 10.1002/bimj.200610278] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Clustering of microarray gene expression data is performed routinely, for genes as well as for samples. Clustering of genes can exhibit functional relationships between genes; clustering of samples on the other hand is important for finding e.g. disease subtypes, relevant patient groups for stratification or related treatments. Usually this is done by first filtering the genes for high-variance under the assumption that they carry most of the information needed for separating different sample groups. If this assumption is violated, important groupings in the data might be lost. Furthermore, classical clustering methods do not facilitate the biological interpretation of the results. Therefore, we propose to methodologically integrate the clustering algorithm with prior biological information. This is different from other approaches as knowledge about classes of genes can be directly used to ease the interpretation of the results and possibly boost clustering performance. Our approach computes dendrograms that resemble decision trees with gene classes used to split the data at each node which can help to find biologically meaningful differences between the sample groups. We have tested the proposed method both on simulated and real data and conclude its usefulness as a complementary method, especially when assumptions of few differentially expressed genes along with an informative mapping of genes to different classes are met.
Collapse
Affiliation(s)
- Henning Redestig
- Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, 14476 Golm, Germany.
| | | | | | | |
Collapse
|
186
|
Do KA, McLachlan G, Bean R, Wen S. Application of gene shaving and mixture models to cluster microarray gene expression data. Cancer Inform 2007; 5:25-43. [PMID: 19390667 PMCID: PMC2666952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Researchers are frequently faced with the analysis of microarray data of a relatively large number of genes using a small number of tissue samples. We examine the application of two statistical methods for clustering such microarray expression data: EMMIX-GENE and GeneClust. EMMIX-GENE is a mixture-model based clustering approach, designed primarily to cluster tissue samples on the basis of the genes. GeneClust is an implementation of the gene shaving methodology, motivated by research to identify distinct sets of genes for which variation in expression could be related to a biological property of the tissue samples. We illustrate the use of these two methods in the analysis of Affymetrix oligonucleotide arrays of well-known data sets from colon tissue samples with and without tumors, and of tumor tissue samples from patients with leukemia. Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data. It is demonstrated, for the case of ribosomal proteins and smooth muscle genes in the colon data set, that both methods can classify genes into co-regulated families. It is further demonstrated that tissue types (tumor and normal) can be separated on the basis of subtle distributed patterns of genes. Application to the leukemia tissue data produces a division of tissues corresponding closely to the external classification, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), for both methods. In addition, we also identify genes specific for the subgroup of ALL-Tcell samples. Overall, we find that the gene shaving method produces gene clusters at great speed; allows variable cluster sizes and can incorporate partial or full supervision; and finds clusters of genes in which the gene expression varies greatly over the tissue samples while maintaining a high level of coherence between the gene expression profiles. The intent of the EMMIX-GENE method is to cluster the tissue samples. It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced.
Collapse
Affiliation(s)
- K-A. Do
- University of Texas, M.D. Anderson Cancer Center, Houston, Texas, U.S.A,Correspondence: K-A. Do, Tel: 713-794-4155; Fax: 713-563-4242;
| | - G.J. McLachlan
- Department of Mathematics & Institute for Molecular Bioscience, University of Queensland Brisbane, 4072, Australia
| | - R. Bean
- Department of Mathematics & Institute for Molecular Bioscience, University of Queensland Brisbane, 4072, Australia
| | - S. Wen
- University of Texas, M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| |
Collapse
|
187
|
Belacel N, Wang Q, Cuperlovic-Culf M. Clustering methods for microarray gene expression data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2007; 10:507-31. [PMID: 17233561 DOI: 10.1089/omi.2006.10.507] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Within the field of genomics, microarray technologies have become a powerful technique for simultaneously monitoring the expression patterns of thousands of genes under different sets of conditions. A main task now is to propose analytical methods to identify groups of genes that manifest similar expression patterns and are activated by similar conditions. The corresponding analysis problem is to cluster multi-condition gene expression data. The purpose of this paper is to present a general view of clustering techniques used in microarray gene expression data analysis.
Collapse
Affiliation(s)
- Nabil Belacel
- National Research Council Canada, Institute for Information Technology, Scientific Park, Moncton, New Brunswick, Canada.
| | | | | |
Collapse
|
188
|
Raveh B, Rahat O, Basri R, Schreiber G. Rediscovering secondary structures as network motifs--an unsupervised learning approach. Bioinformatics 2007; 23:e163-9. [PMID: 17237086 DOI: 10.1093/bioinformatics/btl290] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Secondary structures are key descriptors of a protein fold and its topology. In recent years, they facilitated intensive computational tasks for finding structural homologues, fold prediction and protein design. Their popularity stems from an appealing regularity in patterns of geometry and chemistry. However, the definition of secondary structures is of subjective nature. An unsupervised de-novo discovery of these structures would shed light on their nature, and improve the way we use these structures in algorithms of structural bioinformatics. METHODS We developed a new method for unsupervised partitioning of undirected graphs, based on patterns of small recurring network motifs. Our input was the network of all H-bonds and covalent interactions of protein backbones. This method can be also used for other biological and non-biological networks. RESULTS In a fully unsupervised manner, and without assuming any explicit prior knowledge, we were able to rediscover the existence of conventional alpha-helices, parallel beta-sheets, anti-parallel sheets and loops, as well as various non-conventional hybrid structures. The relation between connectivity and crystallographic temperature factors establishes the existence of novel secondary structures.
Collapse
Affiliation(s)
- Barak Raveh
- Department of Computer Science & Applied Mathematics, Weizmann Institute of Science, Rehovot, 76100, Israel.
| | | | | | | |
Collapse
|
189
|
Yoon S, De Micheli G. An application of zero-suppressed binary decision diagrams to clustering analysis of DNA microarray data. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2925-8. [PMID: 17270890 DOI: 10.1109/iembs.2004.1403831] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Clustering has been one of the most popular techniques to analyze gene expression data. The biclustering method is two-dimensional clustering of genes and experimental conditions to identify a group of genes that display a coherent behavior in some conditions. Although this method may provide additional insight overlooked by traditional clustering techniques, it is often computationally expensive to perform biclustering on practical gene expression data. In this work, we propose a novel biclustering technique that exploits the zero-suppressed binary decision diagrams (ZBDDs) to cope with such a computational challenge. The ZBDDs are a variant of the reduced ordered binary decision diagrams that have found a widespread use in optimization and verification of VLSI digital circuits. Our experimental results demonstrate that the ZBDDs can indeed extend the scalability of our biclustering algorithm substantially, thus enabling us to apply it to a wider spectrum of gene expression data.
Collapse
|
190
|
Martin S, Zhang Z, Martino A, Faulon JL. Boolean dynamics of genetic regulatory networks inferred from microarray time series data. ACTA ACUST UNITED AC 2007; 23:866-74. [PMID: 17267426 DOI: 10.1093/bioinformatics/btm021] [Citation(s) in RCA: 124] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Methods available for the inference of genetic regulatory networks strive to produce a single network, usually by optimizing some quantity to fit the experimental observations. In this article we investigate the possibility that multiple networks can be inferred, all resulting in similar dynamics. This idea is motivated by theoretical work which suggests that biological networks are robust and adaptable to change, and that the overall behavior of a genetic regulatory network might be captured in terms of dynamical basins of attraction. RESULTS We have developed and implemented a method for inferring genetic regulatory networks for time series microarray data. Our method first clusters and discretizes the gene expression data using k-means and support vector regression. We then enumerate Boolean activation-inhibition networks to match the discretized data. Finally, the dynamics of the Boolean networks are examined. We have tested our method on two immunology microarray datasets: an IL-2-stimulated T cell response dataset and a LPS-stimulated macrophage response dataset. In both cases, we discovered that many networks matched the data, and that most of these networks had similar dynamics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shawn Martin
- Sandia National Laboratories, Computational Biology Department, PO Box 5800, Albuquerque, NM 87185-1316, USA
| | | | | | | |
Collapse
|
191
|
Li H, Chen X, Zhang K, Jiang T. A general framework for biclustering gene expression data. J Bioinform Comput Biol 2007; 4:911-33. [PMID: 17007074 DOI: 10.1142/s021972000600217x] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2005] [Revised: 04/21/2006] [Accepted: 04/22/2006] [Indexed: 11/18/2022]
Abstract
A large number of biclustering methods have been proposed to detect patterns in gene expression data. All these methods try to find some type of biclusters but no one can discover all the types of patterns in the data. Furthermore, researchers have to design new algorithms in order to find new types of biclusters/patterns that interest biologists. In this paper, we propose a novel approach for biclustering that, in general, can be used to discover all computable patterns in gene expression data. The method is based on the theory of Kolmogorov complexity. More precisely, we use Kolmogorov complexity to measure the randomness of submatrices as the merit of biclusters because randomness naturally consists in a lack of regularity, which is a common property of all types of patterns. On the basis of algorithmic probability measure, we develop a Markov Chain Monte Carlo algorithm to search for biclusters. Our method can also be easily extended to solve the problems of conventional clustering and checkerboard type biclustering. The preliminary experiments on simulated as well as real data show that our approach is very versatile and promising.
Collapse
Affiliation(s)
- Haifeng Li
- Center of Excellence in Genomic Science, University of Southern California, Los Angeles, CA 90089, USA.
| | | | | | | |
Collapse
|
192
|
Aslan H, Ravid-Amir O, Clancy BM, Rezvankhah S, Pittman D, Pelled G, Turgeman G, Zilberman Y, Gazit Z, Hoffmann A, Gross G, Domany E, Gazit D. Advanced molecular profiling in vivo detects novel function of dickkopf-3 in the regulation of bone formation. J Bone Miner Res 2006; 21:1935-45. [PMID: 17002559 DOI: 10.1359/jbmr.060819] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
UNLABELLED A bioinformatics-based analysis of endochondral bone formation model detected several genes upregulated in this process. Among these genes the dickkopf homolog 3 (Dkk3) was upregulated and further studies showed that its expression affects in vitro and in vivo osteogenesis. This study indicates a possible role of Dkk3 in regulating bone formation. INTRODUCTION Endochondral bone formation is a complex biological process involving numerous chondrogenic, osteogenic, and angiogenic proteins, only some of which have been well studied. Additional key genes may have important roles as well. We hypothesized that to identify key genes and signaling pathways crucial for bone formation, a comprehensive gene discovery strategy should be applied to an established in vivo model of osteogenesis. MATERIALS AND METHODS We used in vivo implanted C3H10T1/2 cells that had been genetically engineered to express human bone morphogenetic protein-2 (BMP2) in a tetracycline-regulated system that controls osteogenic differentiation. Oligonucleotide microarray data from the implants (n = 4 repeats) was analyzed using coupled two-way clustering (CTWC) and statistical methods. For studying the effects of dickkopf homolog 3 (Dkk3) in chondrogenesis and osteogenesis, C3H10T1/2 mesenchymal progenitors were used. RESULTS The CTWC revealed temporal expression of Dkk3 with other chondrogenesis-, osteogenesis-, and Wnt-related genes. Quantitative RT-PCR confirmed the expression of Dkk3 in the implants. C3H10T1/2 cells that expressed Dkk3 in the presence of BMP2 displayed lower levels of alkaline phosphatase and collagen I mRNA expression than control C3H10T1/2 cells that did not express Dkk3. Interestingly, the levels of collagen II mRNA expression, Alcian blue staining, and glucose aminoglycans (GAGs) production were not influenced by Dkk3 expression. In vivo microCT and bioluminescence imaging revealed that co-expression of Dkk3 and BMP2 by implanted C3H10T1/2 cells induced the formation of significantly lower quantities of bone than cells expressing only BMP2. CONCLUSIONS A bioinformatics analysis enabled the identification of Dkk3 as a pivotal gene with a novel function in endochondral bone formation. Our results showed that Dkk3 might have inhibitory effects on osteogenesis, but no effect on chondrogenesis, indicating that Dkk3 plays a regulatory role in endochondral bone formation. Further mechanistic studies are required to reveal the mechanism of action of Dkk3 in endochondral bone formation.
Collapse
Affiliation(s)
- Hadi Aslan
- Skeletal Biotechnology Laboratory, Hebrew University-Hadassah Medical Center, Jerusalem, Israel
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
193
|
Ma X, Lee H, Wang L, Sun F. CGI: a new approach for prioritizing genes by combining gene expression and protein–protein interaction data. Bioinformatics 2006; 23:215-21. [PMID: 17098772 DOI: 10.1093/bioinformatics/btl569] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION Identifying candidate genes associated with a given phenotype or trait is an important problem in biological and biomedical studies. Prioritizing genes based on the accumulated information from several data sources is of fundamental importance. Several integrative methods have been developed when a set of candidate genes for the phenotype is available. However, how to prioritize genes for phenotypes when no candidates are available is still a challenging problem. RESULTS We develop a new method for prioritizing genes associated with a phenotype by Combining Gene expression and protein Interaction data (CGI). The method is applied to yeast gene expression data sets in combination with protein interaction data sets of varying reliability. We found that our method outperforms the intuitive prioritizing method of using either gene expression data or protein interaction data only and a recent gene ranking algorithm GeneRank. We then apply our method to prioritize genes for Alzheimer's disease. AVAILABILITY The code in this paper is available upon request.
Collapse
Affiliation(s)
- Xiaotu Ma
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-2910, USA
| | | | | | | |
Collapse
|
194
|
Abstract
MOTIVATIONS Bi-clustering is an important approach in microarray data analysis. The underlying bases for using bi-clustering in the analysis of gene expression data are (1) similar genes may exhibit similar behaviors only under a subset of conditions, not all conditions, (2) genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another. Using bi-clustering algorithms, one can obtain sets of genes that are co-regulated under subsets of conditions. RESULTS We develop a polynomial time algorithm to find an optimal bi-cluster with the maximum similarity score. To our knowledge, this is the first formulation for bi-cluster problems that admits a polynomial time algorithm for optimal solutions. The algorithm works for a special case, where the bi-clusters are approximately squares. We then extend the algorithm to handle various kinds of other cases. Experiments on simulation data and real data show that the new algorithms outperform most of the existing methods in many cases. Our new algorithms have the following advantages: (1) no discretization procedure is required, (2) performs well for overlapping bi-clusters and (3) works well for additive bi-clusters. AVAILABILITY The software is available at http://www.cs.cityu.edu.hk/~liuxw/msbe/help.html.
Collapse
Affiliation(s)
- Xiaowen Liu
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | | |
Collapse
|
195
|
Ehrenreich A. DNA microarray technology for the microbiologist: an overview. Appl Microbiol Biotechnol 2006; 73:255-73. [PMID: 17043830 DOI: 10.1007/s00253-006-0584-2] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2006] [Revised: 07/11/2006] [Accepted: 07/11/2006] [Indexed: 10/24/2022]
Abstract
DNA microarrays have found widespread use as a flexible tool to investigate bacterial metabolism. Their main advantage is the comprehensive data they produce on the transcriptional response of the whole genome to an environmental or genetic stimulus. This allows the microbiologist to monitor metabolism and to define stimulons and regulons. Other fields of application are the identification of microorganisms or the comparison of genomes. The importance of this technology increases with the number of sequenced genomes and the falling prices for equipment and oligonucleotides. Knowledge of DNA microarrays is of rising relevance for many areas in microbiological research. Much literature has been published on various specific aspects of this technique that can be daunting to the casual user and beginner. This article offers a comprehensive outline of microarray technology for transcription analysis in microbiology. It shortly discusses the types of DNA microarrays available, the printing of custom arrays, common labeling strategies for targets, hybridization, scanning, normalization, and clustering of expression data.
Collapse
Affiliation(s)
- Armin Ehrenreich
- Institute of Microbiology and Genetics, Georg August University, 37077 Göttingen, Germany.
| |
Collapse
|
196
|
|
197
|
Grothaus GA, Mufti A, Murali TM. Automatic layout and visualization of biclusters. Algorithms Mol Biol 2006; 1:15. [PMID: 16952321 PMCID: PMC1624833 DOI: 10.1186/1748-7188-1-15] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2006] [Accepted: 09/04/2006] [Indexed: 11/17/2022] Open
Abstract
Background Biclustering has emerged as a powerful algorithmic tool for analyzing measurements of gene expression. A number of different methods have emerged for computing biclusters in gene expression data. Many of these algorithms may output a very large number of biclusters with varying degrees of overlap. There are no systematic methods that create a two-dimensional layout of the computed biclusters and display overlaps between them. Results We develop a novel algorithm for laying out biclusters in a two-dimensional matrix whose rows (respectively, columns) are rows (respectively, columns) of the original dataset. We display each bicluster as a contiguous submatrix in the layout. We allow the layout to have repeated rows and/or columns from the original matrix as required, but we seek a layout of the smallest size. We also develop a web-based search interface for the user to query the genes and samples of interest and visualise the layout of biclusters matching the queries. Conclusion We demonstrate the usefulness of our approach on gene expression data for two types of leukaemia and on protein-DNA binding data for two growth conditions in Saccharomyces cerevisiae. The software implementing the layout algorithm is available at .
Collapse
Affiliation(s)
- Gregory A Grothaus
- Department of Computer Science, 660 McBryde Hall, Virginia Polytechnic Institute and State University, Blacksburg VA 24061, USA
- Google Inc., 1600 Amphitheater Parkway, Mountain View CA 94043, USA
| | - Adeel Mufti
- Department of Computer Science, 660 McBryde Hall, Virginia Polytechnic Institute and State University, Blacksburg VA 24061, USA
| | - TM Murali
- Department of Computer Science, 660 McBryde Hall, Virginia Polytechnic Institute and State University, Blacksburg VA 24061, USA
| |
Collapse
|
198
|
Teng L, Chan LW. Biclustering Gene Expression Profiles by Alternately Sorting with Weighted Correlated Coefficient. ACTA ACUST UNITED AC 2006. [DOI: 10.1109/mlsp.2006.275563] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
199
|
Beattie BJ, Robinson PN. Binary state pattern clustering: a digital paradigm for class and biomarker discovery in gene microarray studies of cancer. J Comput Biol 2006; 13:1114-30. [PMID: 16796554 DOI: 10.1089/cmb.2006.13.1114] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Class and biomarker discovery continue to be among the preeminent goals in gene microarray studies of cancer. We have developed a new data mining technique, which we call Binary State Pattern Clustering (BSPC) that is specifically adapted for these purposes, with cancer and other categorical datasets. BSPC is capable of uncovering statistically significant sample subclasses and associated marker genes in a completely unsupervised manner. This is accomplished through the application of a digital paradigm, where the expression level of each potential marker gene is treated as being representative of its discrete functional state. Multiple genes that divide samples into states along the same boundaries form a kind of gene-cluster that has an associated sample-cluster. BSPC is an extremely fast deterministic algorithm that scales well to large datasets. Here we describe results of its application to three publicly available oligonucleotide microarray datasets. Using an alpha-level of 0.05, clusters reproducing many of the known sample classifications were identified along with associated biomarkers. In addition, a number of simulations were conducted using shuffled versions of each of the original datasets, noise-added datasets, as well as completely artificial datasets. The robustness of BSPC was compared to that of three other publicly available clustering methods: ISIS, CTWC and SAMBA. The simulations demonstrate BSPC's substantially greater noise tolerance and confirm the accuracy of our calculations of statistical significance.
Collapse
Affiliation(s)
- Bradley J Beattie
- Department of Neurology, Memorial Sloan-Kettering Cancer Center, New York, New York 10021, USA.
| | | |
Collapse
|
200
|
Bidaut G, Manion FJ, Garcia C, Ochs MF. WaveRead: automatic measurement of relative gene expression levels from microarrays using wavelet analysis. J Biomed Inform 2006; 39:379-88. [PMID: 16298556 DOI: 10.1016/j.jbi.2005.10.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2005] [Revised: 08/30/2005] [Accepted: 10/11/2005] [Indexed: 10/25/2022]
Abstract
Gene expression microarrays monitor the expression levels of thousands of genes in an experiment simultaneously. To utilize the information generated, each of the thousands of spots on a microarray image must be properly quantified, including background correction. Most present methods require manual alignment of grids to the image data, and still often require additional minor adjustments on a spot by spot basis to correct for spotting irregularities. Such intervention is time consuming and also introduces inconsistency in the handling of data. A fully automatic, tested system would increase throughput and reliability in this field. In this paper, we describe WaveRead, a fully automated, standalone, open-source system for quantifying gene expression array images. Through the use of wavelet analysis to identify the spot locations and diameters, the system is able to automatically grid the image and quantify signal intensities and background corrections without any user intervention. The ability of WaveRead to perform proper quantification is demonstrated by analysis of both simulated images containing spots with donut shapes, elliptical shapes, and Gaussian intensity distributions, as well as of standard images from the National Cancer Institute.
Collapse
Affiliation(s)
- Ghislain Bidaut
- Bioinformatics, Division of Population Science, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111-2497, USA
| | | | | | | |
Collapse
|