51
|
van Delft JHM, van Agen E, van Breda SGJ, Herwijnen MH, Staal YCM, Kleinjans JCS. Comparison of supervised clustering methods to discriminate genotoxic from non-genotoxic carcinogens by gene expression profiling. Mutat Res 2005; 575:17-33. [PMID: 15924884 DOI: 10.1016/j.mrfmmm.2005.02.006] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2004] [Revised: 02/17/2005] [Accepted: 02/23/2005] [Indexed: 05/02/2023]
Abstract
Prediction of the toxic properties of chemicals based on modulation of gene expression profiles in exposed cells or animals is one of the major applications of toxicogenomics. Previously, we demonstrated that by Pearson correlation analysis of gene expression profiles from treated HepG2 cells it is possible to correctly discriminate and predict genotoxic from non-genotoxic carcinogens. Since to date many different supervised clustering methods for discrimination and prediction tests are available, we investigated whether application of the methods provided by the Whitehead Institute and Stanford University improved our initial prediction. Four different supervised clustering methods were applied for this comparison, namely Pearson correlation analysis (Pearson), nearest shrunken centroids analysis (NSC), K-nearest neighbour analysis (KNN) and Weighted voting (WV). For each supervised clustering method, three different approaches were followed: (1) using all the data points for all treatments, (2) exclusion of the samples with marginally affected gene expression profiles and (3) filtering out the gene expression signals that were hardly altered. On the complete data set, NSC, KNN and WV outperformed the Pearson test, but on the reduced data sets no clear difference was observed. Exclusion of samples with marginally affected profiles improved the prediction by all methods. For the various prediction models, gene sets of different compositions were selected; in these 27 genes appeared three times or more. These 27 genes are involved in many different biological processes and molecular functions, such as apoptosis, cell cycle control, regulation of transcription, and transporter activity, many of them related to the carcinogenic process. One gene, BAX, was selected in all 10 models, while ZFP36 was selected in 9, and AHR, MT1E and TTR in 8. Summarising, this study demonstrates that several supervised clustering methods can be used to discriminate certain genotoxic from non-genotoxic carcinogens by gene expression profiling in vitro in HepG2 cells. None of the methods clearly outperforms the others.
Collapse
Affiliation(s)
- J H M van Delft
- Department of Health Risk Analysis and Toxicology, Maastricht University, P.O. Box 616, 6200 MD Maastricht, The Netherlands.
| | | | | | | | | | | |
Collapse
|
52
|
Abstract
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Collapse
Affiliation(s)
- Rui Xu
- Department of Electrical and Computer Engineering, University of Missouri-Rolla, Rolla, MO 65409, USA.
| | | |
Collapse
|
53
|
Au WH, Chan KCC, Wong AKC, Wang Y. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:83-101. [PMID: 17044174 DOI: 10.1109/tcbb.2005.17] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.
Collapse
Affiliation(s)
- Wai-Ho Au
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
54
|
Figueroa A, Borneman J, Jiang T. Clustering binary fingerprint vectors with missing values for DNA array data analysis. J Comput Biol 2005; 11:887-901. [PMID: 15700408 DOI: 10.1089/cmb.2004.11.887] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
Oligonucleotide fingerprinting is a powerful DNA array-based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our preliminary experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.
Collapse
Affiliation(s)
- Andres Figueroa
- Department of Computer Science, University of California, Riverside 92521, USA.
| | | | | |
Collapse
|
55
|
Mocellin S, Provenzano M, Rossi CR, Pilati P, Nitti D, Lise M. DNA array-based gene profiling: from surgical specimen to the molecular portrait of cancer. Ann Surg 2005; 241:16-26. [PMID: 15621987 PMCID: PMC1356842 DOI: 10.1097/01.sla.0000150157.83537.53] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Cancer is a heterogeneous disease in most respects, including its cellularity, different genetic alterations, and diverse clinical behaviors. Traditional molecular analyses are reductionist, assessing only 1 or a few genes at a time, thus working with a biologic model too specific and limited to confront a process whose clinical outcome is likely to be governed by the combined influence of many genes. The potential of functional genomics is enormous, because for each experiment, thousands of relevant observations can be made simultaneously. Accordingly, DNA array, like other high-throughput technologies, might catalyze and ultimately accelerate the development of knowledge in tumor cell biology. Although in its infancy, the implementation of DNA array technology in cancer research has already provided investigators with novel data and intriguing new hypotheses on the molecular cascade leading to carcinogenesis, tumor aggressiveness, and sensitivity to antiblastic agents. Given the revolutionary implications that the use of this technology might have in the clinical management of patients with cancer, principles of DNA array-based tumor gene profiling need to be clearly understood for the data to be correctly interpreted and appreciated. In the present work, we discuss the technical features characterizing this powerful laboratory tool and review the applications so far described in the field of oncology.
Collapse
Affiliation(s)
- Simone Mocellin
- Surgery Branch, Department of Oncological and Surgical Sciences, University of Padova, Italy.
| | | | | | | | | | | |
Collapse
|
56
|
Ferrazzi F, Magni P, Bellazzi R. Random Walk Models for Bayesian Clustering of Gene Expression Profiles. ACTA ACUST UNITED AC 2005; 4:263-76. [PMID: 16309344 DOI: 10.2165/00822942-200504040-00006] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
The analysis of gene expression temporal profiles is a topic of increasing interest in functional genomics. Model-based clustering methods are particularly interesting because they are able to capture the dynamic nature of these data and to identify the optimal number of clusters. We have defined a new Bayesian method that allows us to cope with some important issues that remain unsolved in the currently available approaches: the presence of time dislocations in gene expression, the non-stationarity of the processes generating the data, and the presence of data collected on an irregular temporal grid. Our method, which is based on random walk models, requires only mild a priori assumptions about the nature of the processes generating the data and explicitly models inter-gene variability within each cluster. It has first been validated on simulated datasets and then employed for the analysis of a dataset relative to serum-stimulated fibroblasts. In all cases, the results have been promising, showing that the method can be helpful in functional genomics research.
Collapse
Affiliation(s)
- Fulvia Ferrazzi
- Dipartimento di Informatica e Sistemistica, Università di Pavia, Pavia, Italy
| | | | | |
Collapse
|
57
|
Clustering Gene Expression Series with Prior Knowledge. LECTURE NOTES IN COMPUTER SCIENCE 2005. [DOI: 10.1007/11557067_3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
58
|
Adjaye J. Whole-genome approaches for large-scale gene identification and expression analysis in mammalian preimplantation embryos. Reprod Fertil Dev 2005; 17:37-45. [PMID: 15745630 DOI: 10.1071/rd04075] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2004] [Accepted: 10/01/2004] [Indexed: 11/23/2022] Open
Abstract
The elucidation, unravelling and understanding of the molecular basis of transcriptional control during preimplantion development is of utmost importance if we are to intervene and eliminate or reduce abnormalities associated with growth, disease and infertility by applying assisted reproduction. Importantly, these studies should enhance our knowledge of basic reproductive biology and its application to regenerative medicine and livestock production. A major obstacle impeding progress in these areas is the ability to successfully generate molecular portraits of preimplantation embryos from their minute amounts of RNA. The present review describes the various approaches whereby classical embryology fuses with molecular biology, high-throughput genomics and systems biology to address and solve questions related to early development in mammals.
Collapse
Affiliation(s)
- James Adjaye
- Max Planck Institute for Molecular Genetics, Department of Vertebrate Genomics, Ihnestrasse 73, D-14195 Berlin, Germany.
| |
Collapse
|
59
|
Liu Y, Navathe SB, Civera J, Dasigi V, Ram A, Ciliax BJ, Dingledine R. Text mining biomedical literature for discovering gene-to-gene relationships: a comparative study of algorithms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:62-76. [PMID: 17044165 DOI: 10.1109/tcbb.2005.14] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.
Collapse
Affiliation(s)
- Ying Liu
- College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30322, USA.
| | | | | | | | | | | | | |
Collapse
|
60
|
Wang Y, Makedon FS, Ford JC, Pearlman J. HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 2004; 21:1530-7. [PMID: 15585531 DOI: 10.1093/bioinformatics/bti192] [Citation(s) in RCA: 132] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Recent studies have shown that microarray gene expression data are useful for phenotype classification of many diseases. A major problem in this classification is that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. Many approaches have been proposed for this gene selection problem. Most of the previous gene ranking methods typically select 50-200 top-ranked genes and these genes are often highly correlated. Our goal is to select a small set of non-redundant marker genes that are most relevant for the classification task. RESULTS To achieve this goal, we developed a novel hybrid approach that combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters. Empirical study using three public datasets shows that our approach is capable of selecting relatively few marker genes while offering the same or better leave-one-out cross-validation accuracy compared with approaches that use top-ranked genes directly for classification. AVAILABILITY The HykGene software is freely available at http://www.cs.dartmouth.edu/~wyh/software.htm CONTACT wyh@cs.dartmouth.edu SUPPLEMENTARY INFORMATION Supplementary material is available from http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm.
Collapse
Affiliation(s)
- Yuhang Wang
- Department of Computer Science, Dartmouth College, 6211 Sudikoff Laboratory, Hanover, NH 03755-3510, USA.
| | | | | | | |
Collapse
|
61
|
Illiger J, Herwig R, Steinfath M, Przewieslik T, Elge T, Bull C, Radelof U, Lehrach H, Janitz M. Establishment of T cell-specific and natural killer cell-specific unigene sets: towards high-throughput genomics of leukaemia. EUROPEAN JOURNAL OF IMMUNOGENETICS : OFFICIAL JOURNAL OF THE BRITISH SOCIETY FOR HISTOCOMPATIBILITY AND IMMUNOGENETICS 2004; 31:253-7. [PMID: 15548262 DOI: 10.1111/j.1365-2370.2004.00483.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/01/2023]
Abstract
We report the establishment of highly non-redundant unigene sets consisting of cDNA clones derived from T lymphocytes and natural killer cells. Each set consists of 10 506 and 13 409 clones, respectively, arrayed on nylon membranes in duplicate. The sets provide an excellent tool for genome-wide gene expression analysis studies in immunology research.
Collapse
Affiliation(s)
- J Illiger
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
62
|
Yoo C, Cooper GF. An evaluation of a system that recommends microarray experiments to perform to discover gene-regulation pathways. Artif Intell Med 2004; 31:169-82. [PMID: 15219293 DOI: 10.1016/j.artmed.2004.01.018] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2003] [Revised: 04/14/2003] [Accepted: 01/16/2004] [Indexed: 11/23/2022]
Abstract
The main topic of this paper is modeling the expected value of experimentation (EVE) for discovering causal pathways in gene expression data. By experimentation we mean both interventions (e.g., a gene knockout experiment) and observations (e.g., passively observing the expression level of a "wild-type" gene). We introduce a system called GEEVE (causal discovery in Gene Expression data using Expected Value of Experimentation), which implements expected value of experimentation in discovering causal pathways using gene expression data. GEEVE provides the following assistance, which is intended to help biologists in their quest to discover gene-regulation pathways: Recommending which experiments to perform (with a focus on "knockout" experiments) using an expected value of experimentation method. Recommending the number of measurements (observational and experimental) to include in the experimental design, again using an EVE method. Providing a Bayesian analysis that combines prior knowledge with the results of recent microarray experimental results to derive posterior probabilities of gene regulation relationships. In recommending which experiments to perform (and how many times to repeat them) the EVE approach considers the biologist's preferences for which genes to focus the discovery process. Also, since exact EVE calculations are exponential in time, GEEVE incorporates approximation methods. GEEVE is able to combine data from knockout experiments with data from wild-type experiments to suggest additional experiments to perform and then to analyze the results of those microarray experimental results. It models the possibility that unmeasured (latent) variables may be responsible for some of the statistical associations among the expression levels of the genes under study. To evaluate the GEEVE system, we used a gene expression simulator to generate data from specified models of gene regulation. The results show that the GEEVE system gives better results than two recently published approaches (1) in learning the generating models of gene regulation and (2) in recommending experiments to perform.
Collapse
Affiliation(s)
- Changwon Yoo
- 420 Social Science, University of Montana, Missoula, MT 59812, USA.
| | | |
Collapse
|
63
|
Balasubramaniyan R, Hüllermeier E, Weskamp N, Kämper J. Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics 2004; 21:1069-77. [PMID: 15513997 DOI: 10.1093/bioinformatics/bti095] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Microarray technology enables the study of gene expression in large scale. The application of methods for data analysis then allows for grouping genes that show a similar expression profile and that are thus likely to be co-regulated. A relationship among genes at the biological level often presents itself by locally similar and potentially time-shifted patterns in their expression profiles. RESULTS Here, we propose a new method (CLARITY; Clustering with Local shApe-based similaRITY) for the analysis of microarray time course experiments that uses a local shape-based similarity measure based on Spearman rank correlation. This measure does not require a normalization of the expression data and is comparably robust towards noise. It is also able to detect similar and even time-shifted sub-profiles. To this end, we implemented an approach motivated by the BLAST algorithm for sequence alignment. We used CLARITY to cluster the times series of gene expression data during the mitotic cell cycle of the yeast Saccharomyces cerevisiae. The obtained clusters were related to the MIPS functional classification to assess their biological significance. We found that several clusters were significantly enriched with genes that share similar or related functions.
Collapse
Affiliation(s)
- Rajarajeswari Balasubramaniyan
- Max-Planck Institute for Terrestrial Microbiology, Department of Organismic Interactions Karl-von-Frisch-Strasse, 35043 Marburg, Germany
| | | | | | | |
Collapse
|
64
|
Desper R, Khan J, Schäffer AA. Tumor classification using phylogenetic methods on expression data. J Theor Biol 2004; 228:477-96. [PMID: 15178197 DOI: 10.1016/j.jtbi.2004.02.021] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2003] [Revised: 02/03/2004] [Accepted: 02/20/2004] [Indexed: 10/26/2022]
Abstract
Tumor classification is a well-studied problem in the field of bioinformatics. Developments in the field of DNA chip design have now made it possible to measure the expression levels of thousands of genes in sample tissue from healthy cell lines or tumors. A number of studies have examined the problems of tumor classification: class discovery, the problem of defining a number of classes of tumors using the data from a DNA chip, and class prediction, the problem of accurately classifying an unknown tumor, given expression data from the unknown tumor and from a learning set. The current work has applied phylogenetic methods to both problems. To solve the class discovery problem, we impose a metric on a set of tumors as a function of their gene expression levels, and impose a tree structure on this metric, using standard tree fitting methods borrowed from the field of phylogenetics. Phylogenetic methods provide a simple way of imposing a clear hierarchical relationship on the data, with branch lengths in the classification tree representing the degree of separation witnessed. We tested our method for class discovery on two data sets: a data set of 87 tissues, comprised mostly of small, round, blue-cell tumors (SRBCTs), and a data set of 22 breast tumors. We fit the 87 samples of the first set to a classification tree, which neatly separated into four major clusters corresponding exactly to the four groups of tumors, namely neuroblastomas, rhabdomyosarcomas, Burkitt's lymphomas, and the Ewing's family of tumors. The classification tree built using the breast cancer data separated tumors with BRCA1 mutations from those with BRCA2 mutations, with sporadic tumors separated from both groups and from each other. We also demonstrate the flexibility of the class discovery method with regard to standard resampling methodology such as jackknifing and noise perturbation. To solve the class prediction problem, we built a classification tree on the learning set, and then sought the optimal placement of each test sample within the classification tree. We tested this method on the SRBCT data set, and classified each tumor successfully.
Collapse
Affiliation(s)
- Richard Desper
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bldg. 38A, Room 8N805, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|
65
|
Daub CO, Steuer R, Selbig J, Kloska S. Estimating mutual information using B-spline functions--an improved similarity measure for analysing gene expression data. BMC Bioinformatics 2004; 5:118. [PMID: 15339346 PMCID: PMC516800 DOI: 10.1186/1471-2105-5-118] [Citation(s) in RCA: 194] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2003] [Accepted: 08/31/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size. RESULTS In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two large-scale gene expression datasets and the results are compared to those obtained using other similarity measures.A C++ source code of our algorithm is available for non-commercial use from kloska@scienion.de upon request. CONCLUSION The utilisation of mutual information as similarity measure enables the detection of non-linear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an ad-hoc basis without further justification, are thereby extended.
Collapse
Affiliation(s)
- Carsten O Daub
- Max Planck Institute of Molecular Plant Physiology, Potsdam, 14424, Germany
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, 17177, Sweden
| | - Ralf Steuer
- Nonlinear Dynamics Group, Institute of Physics, University of Potsdam, Potsdam, 14415, Germany
| | - Joachim Selbig
- Max Planck Institute of Molecular Plant Physiology, Potsdam, 14424, Germany
| | - Sebastian Kloska
- Max Planck Institute of Molecular Plant Physiology, Potsdam, 14424, Germany
- Scienion AG, Volmerstrasse 7a, Berlin, 12489, Germany
| |
Collapse
|
66
|
Büssow K, Quedenau C, Sievert V, Tischer J, Scheich C, Seitz H, Hieke B, Niesen FH, Götz F, Harttig U, Lehrach H. A catalog of human cDNA expression clones and its application to structural genomics. Genome Biol 2004; 5:R71. [PMID: 15345055 PMCID: PMC522878 DOI: 10.1186/gb-2004-5-9-r71] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2004] [Revised: 07/21/2004] [Accepted: 07/23/2004] [Indexed: 11/10/2022] Open
Abstract
We describe here a systematic approach to the identification of human proteins and protein fragments that can be expressed as soluble proteins in Escherichia coli. A cDNA expression library of 10,825 clones was screened by small-scale expression and purification and 2,746 clones were identified. Sequence and protein-expression data were entered into a public database. A set of 163 clones was selected for structural analysis and 17 proteins were prepared for crystallization, leading to three new structures.
Collapse
Affiliation(s)
- Konrad Büssow
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Claudia Quedenau
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Volker Sievert
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Janett Tischer
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Christoph Scheich
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Harald Seitz
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Brigitte Hieke
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Frank H Niesen
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Institute of Medical Physics and Biophysics, Charité Medical School, Ziegelstraße 5/9, 10117 Berlin, Germany
| | - Frank Götz
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Alpha Bioverfahrenstechnik GmbH, Heinrich-Hertz-Straße 1b, 14532 Kleinmachnow, Germany
| | - Ulrich Harttig
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- RZPD German Resource Center for Genome Research GmbH, Heubnerweg 6, 14059 Berlin, Germany
| | - Hans Lehrach
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| |
Collapse
|
67
|
Barra V. Analysis of gene expression data using functional principal components. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2004; 75:1-9. [PMID: 15158042 DOI: 10.1016/j.cmpb.2003.08.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2002] [Revised: 08/28/2003] [Accepted: 08/28/2003] [Indexed: 05/24/2023]
Abstract
The large amount of data involved in DNA microarrays implies the development of efficient computer algorithms to analyze the gene expressions, and thus to study the transcriptome. Numerous techniques already exist and we propose a new method based on the key idea that gene profiles may be considered as continuous curves. The analysis of the set of curves stemming from the DNA microarray may be then performed using a functional analysis which can exhibit the main modes of variations in this set, gather genes with similar variations and extract characteristic parameters of gene profiles. We aim here at introducing this method, called the Functional Principal Component Analysis. A prospective study has been performed on two available datasets, concerning on the one hand the sporulation data of the Saccharomyces cerevisiae, and on the other hand data of tumor cell lines. Results are very promising: the method is able to extract characteristic parameters from the datasets, to extract significant modes of variations in the set of gene profiles, and to link these variations to biological processes already studied in literature.
Collapse
Affiliation(s)
- Vincent Barra
- LIMOS, UMR CNRS 6158, Campus des Cézeaux, Aubiere 63117, France.
| |
Collapse
|
68
|
Tsai HK, Yang JM, Tsai YF, Kao CY. An evolutionary approach for gene expression patterns. IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE : A PUBLICATION OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY 2004; 8:69-78. [PMID: 15217251 DOI: 10.1109/titb.2004.826713] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This study presents an evolutionary algorithm, called a heterogeneous selection genetic algorithm (HeSGA), for analyzing the patterns of gene expression on microarray data. Microarray technologies have provided the means to monitor the expression levels of a large number of genes simultaneously. Gene clustering and gene ordering are important in analyzing a large body of microarray expression data. The proposed method simultaneously solves gene clustering and gene-ordering problems by integrating global and local search mechanisms. Clustering and ordering information is used to identify functionally related genes and to infer genetic networks from immense microarray expression data. HeSGA was tested on eight test microarray datasets, ranging in size from 147 to 6221 genes. The experimental clustering and visual results indicate that HeSGA not only ordered genes smoothly but also grouped genes with similar gene expressions. Visualized results and a new scoring function that references predefined functional categories were employed to confirm the biological interpretations of results yielded using HeSGA and other methods. These results indicate that HeSGA has potential in analyzing gene expression patterns.
Collapse
Affiliation(s)
- Huai-Kuang Tsai
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan, ROC.
| | | | | | | |
Collapse
|
69
|
Poustka AJ, Groth D, Hennig S, Thamm S, Cameron A, Beck A, Reinhardt R, Herwig R, Panopoulou G, Lehrach H. Generation, annotation, evolutionary analysis, and database integration of 20,000 unique sea urchin EST clusters. Genome Res 2004; 13:2736-46. [PMID: 14656975 PMCID: PMC403816 DOI: 10.1101/gr.1674103] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Together with the hemichordates, sea urchins represent basal groups of nonchordate invertebrate deuterostomes that occupy a key position in bilaterian evolution. Because sea urchin embryos are also amenable to functional studies, the sea urchin system has emerged as one of the leading models for the analysis of the function of genomic regulatory networks that control development. We have analyzed a total of 107,283 cDNA clones of libraries that span the development of the sea urchin Strongylocentrotus purpuratus. Normalization by oligonucleotide fingerprinting, EST sequencing and sequence clustering resulted in an EST catalog comprised of 20,000 unique genes or gene fragments. Around 7000 of the unique EST consensus sequences were associated with molecular and developmental functions. Phylogenetic comparison of the identified genes to the genome of the urochordate Ciona intestinalis indicate that at least one quarter of the genes thought to be chordate specific were already present at the base of deuterostome evolution. Comparison of the number of gene copies in sea urchins to those in chordates and vertebrates indicates that the sea urchin genome has not undergone extensive gene or complete genome duplications. The established unique gene set represents an essential tool for the annotation and assembly of the forthcoming sea urchin genome sequence. All cDNA clones and filters of all analyzed libraries are available from the resource center of the German genome project at http://www.rzpd.de.
Collapse
Affiliation(s)
- Albert J Poustka
- Evolution and Development Group, Max Planck Institute for Molecular Genetics, Department of Vertebrate Genomics, 14195 Berlin, Germany.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
70
|
Jiang D, Pei J, Zhang A. Towards interactive exploration of gene expression patterns. ACTA ACUST UNITED AC 2003. [DOI: 10.1145/980972.980983] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Analyzing coherent gene expression patterns is an important task in bioinformatics research and biomedical applications. Recently, various clustering methods have been adapted or proposed to identify clusters of co-expressed genes and recognize coherent expression patterns as the centroids of the clusters. However, the interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge, which presents several challenges for coherent pattern mining and cannot be solved by most existing clustering approaches.In this paper, we introduce an
interactive exploration
system
GeneX
(Gene eXplorer) for mining coherent expression patterns. We develop a novel
coherent pattern index graph
to provide highly confident indications of the existence of coherent patterns. Typical exploration operations are supported based on the index graph. We also provide a bunch of graphical views as the user interface to visualize the data set and facilitate the interactive operations. To help users to interpret and validate the mining results, we design the
gene annotation panel
that connects the genes with some public annotation databases. The experimental results show that our approach is more effective than the state-of-the-art methods in mining real gene expression data sets.
Collapse
Affiliation(s)
| | - Jian Pei
- State University of New York at Buffalo
| | | |
Collapse
|
71
|
Xu D, Olman V, Wang L, Xu Y. EXCAVATOR: a computer program for efficiently mining gene expression data. Nucleic Acids Res 2003; 31:5582-9. [PMID: 14500821 PMCID: PMC206478 DOI: 10.1093/nar/gkg783] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2003] [Revised: 08/01/2003] [Accepted: 08/18/2003] [Indexed: 11/14/2022] Open
Abstract
Massive amounts of gene expression data are generated using microarrays for functional studies of genes and gene expression data clustering is a useful tool for studying the functional relationship among genes in a biological process. We have developed a computer package EXCAVATOR for clustering gene expression profiles based on our new framework for representing gene expression data as a minimum spanning tree. EXCAVATOR uses a number of rigorous and efficient clustering algorithms. This program has a number of unique features, including capabilities for: (i) data- constrained clustering; (ii) identification of genes with similar expression profiles to pre-specified seed genes; (iii) cluster identification from a noisy background; (iv) computational comparison between different clustering results of the same data set. EXCAVATOR can be run from a Unix/Linux/DOS shell, from a Java interface or from a Web server. The clustering results can be visualized as colored figures and 2-dimensional plots. Moreover, EXCAVATOR provides a wide range of options for data formats, distance measures, objective functions, clustering algorithms, methods to choose number of clusters, etc. The effectiveness of EXCAVATOR has been demonstrated on several experimental data sets. Its performance compares favorably against the popular K-means clustering method in terms of clustering quality and computing time.
Collapse
Affiliation(s)
- Dong Xu
- Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6480, USA.
| | | | | | | |
Collapse
|
72
|
Katagiri F, Glazebrook J. Local Context Finder (LCF) reveals multidimensional relationships among mRNA expression profiles of Arabidopsis responding to pathogen infection. Proc Natl Acad Sci U S A 2003; 100:10842-7. [PMID: 12960373 PMCID: PMC196890 DOI: 10.1073/pnas.1934349100] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A major task in computational analysis of mRNA expression profiles is definition of relationships among profiles on the basis of similarities among them. This is generally achieved by pattern recognition in the distribution of data points representing each profile in a high-dimensional space. Some drawbacks of commonly used pattern recognition algorithms stem from their use of a globally linear space and/or limited degrees of freedom. A pattern recognition method called Local Context Finder (LCF) is described here. LCF uses nonlinear dimensionality reduction for pattern recognition. Then it builds a network of profiles based on the nonlinear dimensionality reduction results. LCF was used to analyze mRNA expression profiles of the plant host Arabidopsis interacting with the bacterial pathogen Pseudomonas syringae. In one case, LCF revealed two dimensions essential to explain the effects of the NahG transgene and the ndr1 mutation on resistant and susceptible responses. In another case, plant mutants deficient in responses to pathogen infection were classified on the basis of LCF analysis of their profiles. The classification by LCF was consistent with the results of biological characterization of the mutants. Thus, LCF is a powerful method for extracting information from expression profile data.
Collapse
Affiliation(s)
- Fumiaki Katagiri
- Torrey Mesa Research Institute, Syngenta Research and Technology, 3115 Merryfield Row, San Diego, CA 92121, USA.
| | | |
Collapse
|
73
|
Kato N, Kobayashi T, Honda H. Screening of stress enhancer based on analysis of gene expression profiles: enhancement of hyperthermia-induced tumor necrosis by an MMP-3 inhibitor. Cancer Sci 2003; 94:644-9. [PMID: 12841876 PMCID: PMC11160297 DOI: 10.1111/j.1349-7006.2003.tb01497.x] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2003] [Revised: 05/01/2003] [Accepted: 05/06/2003] [Indexed: 11/29/2022] Open
Abstract
To improve the therapeutic benefit of hyperthermia, we examined changes of global gene expression after heat shock using DNA microarrays consisting of 12 814 clones. HeLa cells were treated for 1 h at 44 degrees C and RNA was extracted from the cells 0, 3, 6, and 12 h after heat shock. The 664 genes that were up or down-regulated after heat shock were classified into 7 clusters using fuzzy adaptive resonance theory (fuzzy ART). There were 41 genes in two clusters that were induced in the early phase after heat shock. In addition to shock response genes, such as hsp70 and hsp40, the stress response genes c-jun, c-fos and egr-1 were expressed in the early phase after heat shock. We also found that expression of matrix metalloproteinase 3 (MMP-3) was enhanced during the early response. We therefore investigated the role of MMP-3 in the heat shock response by examining HeLa cell survival after heat treatment in the presence and absence of an MMP-3 inhibitor, N-isobutyl-N-(4-methoxyphenylsulfonyl)glycylhydroxamic acid (NNGH) or N-hydroxy-2(R)-[[4- methoxysulfonyl](3-picolyl)amino]-3-methylbutaneamide hydrochloride (MMI270). The number of surviving cells 3 days after heat treatment significantly decreased, reaching 3.5% for NNGH and 0.2% for MMI270. These results indicate that the MMP-3 inhibitors enhanced heat shock-induced cell death and behaved as stress enhancers in cancer cells. This valuable conclusion was reached as a direct result of the gene expression profiling that was performed in these studies.
Collapse
Affiliation(s)
- Naoki Kato
- Department of Biotechnology, School of Engineering, Nagoya University, Chikusa-ku, Nagoya 464-8603, Japan
| | | | | |
Collapse
|
74
|
Panopoulou G, Hennig S, Groth D, Krause A, Poustka AJ, Herwig R, Vingron M, Lehrach H. New evidence for genome-wide duplications at the origin of vertebrates using an amphioxus gene set and completed animal genomes. Genome Res 2003; 13:1056-66. [PMID: 12799346 PMCID: PMC403660 DOI: 10.1101/gr.874803] [Citation(s) in RCA: 129] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The 2R hypothesis predicting two genome duplications at the origin of vertebrates is highly controversial. Studies published so far include limited sequence data from organisms close to the hypothesized genome duplications. Through the comparison of a gene catalog from amphioxus, the closest living invertebrate relative of vertebrates, to 3453 single-copy genes orthologous between Caenorhabditis elegans (C), Drosophila melanogaster (D), and Saccharomyces cerevisiae (Y), and to Ciona intestinalis ESTs, mouse, and human genes, we show with a large number of genes that the gene duplication activity is significantly higher after the separation of amphioxus and the vertebrate lineages, which we estimate at 650 million years (Myr). The majority of human orthologs of 195 CDY groups that could be dated by the molecular clock appear to be duplicated between 300 and 680 Myr with a mean at 488 million years ago (Mya). We detected 485 duplicated chromosomal segments in the human genome containing CDY orthologs, 331 of which are found duplicated in the mouse genome and within regions syntenic between human and mouse, indicating that these were generated earlier than the human-mouse split. Model based calculations of the codon substitution rate of the human genes included in these segments agree with the molecular clock duplication time-scale prediction. Our results favor at least one large duplication event at the origin of vertebrates, followed by smaller scale duplication closer to the bird-mammalian split.
Collapse
Affiliation(s)
- Georgia Panopoulou
- Evolution and Development Group, Department Professor H. Lehrach, Max-Planck Institut für Molekulare Genetik, D-14195 Berlin, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
75
|
Bicciato S, Pandin M, Didonè G, Di Bello C. Pattern identification and classification in gene expression data using an autoassociative neural network model. Biotechnol Bioeng 2003; 81:594-606. [PMID: 12514809 DOI: 10.1002/bit.10505] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The application of DNA microarray technology for analysis of gene expression creates enormous opportunities to accelerate the pace in understanding living systems and identification of target genes and pathways for drug development and therapeutic intervention. Parallel monitoring of the expression profiles of thousands of genes seems particularly promising for a deeper understanding of cancer biology and the identification of molecular signatures supporting the histological classification schemes of neoplastic specimens. However, the increasing volume of data generated by microarray experiments poses the challenge of developing equally efficient methods and analysis procedures to extract, interpret, and upgrade the information content of these databases. Herein, a computational procedure for pattern identification, feature extraction, and classification of gene expression data through the analysis of an autoassociative neural network model is described. The identified patterns and features contain critical information about gene-phenotype relationships observed during changes in cell physiology. They represent a rational and dimensionally reduced base for understanding the basic biology of the onset of diseases, defining targets of therapeutic intervention, and developing diagnostic tools for the identification and classification of pathological states. The proposed method has been tested on two different microarray datasets-Golub's analysis of acute human leukemia [Golub et al. (1999) Science 286:531-537], and the human colon adenocarcinoma study presented by Alon et al. [1999; Proc Natl Acad Sci USA 97:10101-10106]. The analysis of the neural network internal structure allows the identification of specific phenotype markers and the extraction of peculiar associations among genes and physiological states. At the same time, the neural network outputs provide assignment to multiple classes, such as different pathological conditions or tissue samples, for previously unseen instances.
Collapse
Affiliation(s)
- Silvio Bicciato
- Department of Chemical Process Engineering, University of Padova, via Marzolo, 9, 35131, Padova, Italy.
| | | | | | | |
Collapse
|
76
|
Peterson LE. Partitioning large-sample microarray-based gene expression profiles using principal components analysis. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2003; 70:107-119. [PMID: 12507787 DOI: 10.1016/s0169-2607(02)00009-3] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Principal components analysis (PCA) is useful for reproducing the total variation among hundreds or thousands of continuously-scaled variables with a much smaller number of unobservable variables called 'latent factors'. The CLUSFAVOR computer program was used to implement PCA for identifying groups of genes with similar expression profiles from a large number of genes used on DNA microarrays. This paper describes the principal components solution to the factor model of the correlation matrix R, calculation of eigenvalues and eigenvectors of R, extraction of factors, and calculation of factor loadings and identification of genes with similar loading patterns to construct groups of genes with similar expression profiles. With regard to extraction of factors, it was found that more than 90% of the total variance in input data could be accounted for by extracting factors whose eigenvalues exceed unity. Bipolar factors containing strong positive and negative loadings can also be used for identifying two unique groups of genes, since expression profiles of genes that load positive are unlike expression profiles of genes that load negative on the same factor. While PCA does not provide the absolute answer to a multidimensional problem, it nevertheless can provide a heuristic with which natural groupings of genes with similar expression profiles can be assembled. While cluster analysis essentially generates a single dendogram (tree branch) containing every gene in the input data, PCA can be used to assemble gene expression profiles that strongly correlate with the latent factors accounting for a majority of total variance. Example results for CLUSFAVOR computer program runs are provided.
Collapse
Affiliation(s)
- Leif E Peterson
- Department of Medicine, Baylor College of Medicine, One Baylor Plaza ST-924, Houston, TX 77030, USA.
| |
Collapse
|
77
|
Abstract
A common approach to the analysis of gene expression data is to define clusters of genes that have similar expression. A critical step in cluster analysis is the determination of similarity between the expression levels of two genes. We introduce a neural network-based similarity index as a non-linear similarity index and compare the results with other proximity measures for Saccharomyces cerevisiae gene expression data. We show that the clusters obtained using Euclidean distance, correlation coefficients, and mutual information were not significantly different. The clusters formed with the neural network-based index were more in agreement with those defined by functional categories and common regulatory motifs.
Collapse
Affiliation(s)
- Tomohiro Sawa
- Division of Health Sciences and Technology, Harvard Medical School and Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA.
| | | |
Collapse
|
78
|
Sharan R, Elkon R, Shamir R. Cluster analysis and its applications to gene expression data. ERNST SCHERING RESEARCH FOUNDATION WORKSHOP 2002:83-108. [PMID: 12061008 DOI: 10.1007/978-3-662-04747-7_5] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- R Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel.
| | | | | |
Collapse
|
79
|
Herwig R, Schulz B, Weisshaar B, Hennig S, Steinfath M, Drungowski M, Stahl D, Wruck W, Menze A, O'Brien J, Lehrach H, Radelof U. Construction of a 'unigene' cDNA clone set by oligonucleotide fingerprinting allows access to 25 000 potential sugar beet genes. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2002; 32:845-57. [PMID: 12472698 DOI: 10.1046/j.1365-313x.2002.01457.x] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Access to the complete gene inventory of an organism is crucial to understanding physiological processes like development, differentiation, pathogenesis, or adaptation to the environment. Transcripts from many active genes are present at low copy numbers. Therefore, procedures that rely on random EST sequencing or on normalisation and subtraction methods have to produce massively redundant data to get access to low-abundance genes. Here, we present an improved oligonucleotide fingerprinting (ofp) approach to the genome of sugar beet (Beta vulgaris), a plant for which practically no molecular information has been available. To identify distinct genes and to provide a representative 'unigene' cDNA set for sugar beet, 159 936 cDNA clones were processed utilizing large-scale, high-throughput data generation and analysis methods. Data analysis yielded 30 444 ofp clusters reflecting the number of different genes in the original cDNA sample. A sample of 10 961 cDNA clones, each representing a different cluster, were selected for sequencing. Standard sequence analysis confirmed that 89% of these EST sequences did represent different genes. These results indicate that the full set of 30 444 ofp clusters represent up to 25 000 genes. We conclude that the ofp analysis pipeline is an accurate and effective way to construct large representative 'unigene' sets for any plant of interest with no requirement for prior molecular sequence data.
Collapse
Affiliation(s)
- Ralf Herwig
- Max-Planck Institute for Molecular Genetics, Ihnestr. 73, D-14195 Berlin, Germany.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
80
|
Fuchs T, Malecova B, Linhart C, Sharan R, Khen M, Herwig R, Shmulevich D, Elkon R, Steinfath M, O'Brien JK, Radelof U, Lehrach H, Lancet D, Shamir R. DEFOG: a practical scheme for deciphering families of genes. Genomics 2002; 80:295-302. [PMID: 12213199 DOI: 10.1006/geno.2002.6830] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We developed a novel efficient scheme, DEFOG (for "deciphering families of genes"), for determining sequences of numerous genes from a family of interest. The scheme provides a powerful means to obtain a gene family composition in species for which high-throughput genomic sequencing data are not available. DEFOG uses two key procedures. The first is a novel algorithm for designing highly degenerate primers based on a set of known genes from the family of interest. These primers are used in PCR reactions to amplify the members of the gene family. The second combines oligofingerprinting of the cloned PCR products with clustering of the clones based on their fingerprints. By selecting members from each cluster, a low-redundancy clone subset is chosen for sequencing. We applied the scheme to the human olfactory receptor (OR) genes. OR genes constitute the largest gene superfamily in the human genome, as well as in the genomes of other vertebrate species. DEFOG almost tripled the size of the initial repertoire of human ORs in a single experiment, and only 7% of the PCR clones had to be sequenced. Extremely high degeneracies, reaching over a billion combinations of distinct PCR primer pairs, proved to be very effective and yielded only 0.4% nonspecific products.
Collapse
Affiliation(s)
- Tania Fuchs
- Department of Molecular Genetics and the Crown Human Genome Center, The Weizmann Institute of Science, Rehovot 76100, Israel
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
81
|
Herrero J, Dopazo J. Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns. J Proteome Res 2002; 1:467-70. [PMID: 12645919 DOI: 10.1021/pr025521v] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Self-organizing maps (SOM) constitute an alternative to classical clustering methods because of its linear run times and superior performance to deal with noisy data. Nevertheless, the clustering obtained with SOM is dependent on the relative sizes of the clusters. Here, we show how the combination of SOM with hierarchical clustering methods constitutes an excellent tool for exploratory analysis of massive data like DNA microarray expression patterns.
Collapse
Affiliation(s)
- Javier Herrero
- Bioinformatics Unit, Spanish National Cancer Center (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
| | | |
Collapse
|
82
|
Ramaswamy S, Nakamura N, Sansal I, Bergeron L, Sellers WR. A novel mechanism of gene regulation and tumor suppression by the transcription factor FKHR. Cancer Cell 2002; 2:81-91. [PMID: 12150827 DOI: 10.1016/s1535-6108(02)00086-7] [Citation(s) in RCA: 333] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The mammalian DAF-16-like transcription factors, FKHR, FKHRL1, and AFX, function as key regulators of insulin signaling, cell cycle progression, and apoptosis downstream of phosphoinositide 3-kinase. Gene activation through binding to insulin response sequences (IRS) has been thought to be essential for mediating these functions. However, using transcriptional profiling, chromatin immunoprecipitation, and functional experiments, we demonstrate that rather than activation of IRS regulated genes (Class I transcripts), transcriptional repression of D-type cyclins (in Class III) is required for FKHR mediated inhibition of cell cycle progression and transformation. These data suggest that a novel mechanism of FKHR-mediated gene regulation is linked to its activity as a suppressor of tumor growth.
Collapse
Affiliation(s)
- Shivapriya Ramaswamy
- Department of Adult Oncology and Department of Internal Medicine, Dana-Farber Cancer Institute and Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts 02115, USA
| | | | | | | | | |
Collapse
|
83
|
Abstract
The work presented here attempts to consolidate our knowledge on cellular transcriptome and proteome. It takes into account that a typical activated cell (lymphocyte) contains 40 000 mRNA molecules at any time, and it represents about 5000 different molecular species of transcripts. Such a cell has about 1 000 000 000 protein molecules, some of them being present at 10 000 000 copies while others at a very low copy number (say 1 to 10 copies per cell). By studying cell free expression of individual cDNA clones (or pools of known complexity) we address to those rare molecular components that will remain undetected by the current analytical means. For our analysis we use cell free translation systems (wheat germ or rabbit reticulocyte origin) and we study polypeptide products originating from intact, or restriction endonuclease-treated cDNA clones. We conclude that in most instances expressed genes yield transcript(s) that translate into several, and often very numerous families of polypeptide species. In our ISODALT two-dimensional gel system we characterize the proteomic profile of the clonal polypeptide families in terms of their molecular mass, charge, multiple products, and appearance.
Collapse
|
84
|
Li MD, Konu O, Kane JK, Becker KG. Microarray technology and its application on nicotine research. Mol Neurobiol 2002; 25:265-85. [PMID: 12109875 DOI: 10.1385/mn:25:3:265] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Since its development, microarray technique has revolutionized almost all fields of biomedical research by enabling high-throughput gene expression profiling. Using cDNA microarrays, thousands of genes from various organisms have been examined with respect to differentiation/development, disease diagnosis, and drug discovery Nevertheless, research on nicotine using cDNA microarrays has been rather limited. Therefore, it is our intention in this article to report the findings of our cDNA microarray study on nicotine. We first present an overview of the microarray technology, particularly focusing on the factors related to microarray design and analysis. Second, we provide a detailed description of several newly identified biological pathways in our laboratory, such as phosphatidylinositol signaling and calcium homeostasis, which are involved in response to chronic nicotine administration. Additionally, we illustrate how comparisons between microarray studies help identify candidate genes that potentially may explain the observed inverse association between smoking and schizophrenia. Lastly, given the early stage of microarray research on nicotine, we elaborate on the need for an efficient analysis of genetic networks to further enhance our understanding of the mechanisms involved in nicotine abuse and addiction.
Collapse
Affiliation(s)
- Ming D Li
- Department of Pharmacology, University of Tennessee College of Medicine, Memphis 38163, USA.
| | | | | | | |
Collapse
|
85
|
|
86
|
Hess KR, Zhang W, Baggerly KA, Stivers DN, Coombes KR. Microarrays: handling the deluge of data and extracting reliable information. Trends Biotechnol 2001; 19:463-8. [PMID: 11602311 DOI: 10.1016/s0167-7799(01)01792-9] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Application of powerful, high-throughput genomics technologies is becoming more common and these technologies are evolving at a rapid pace. Genomics facilities are being established in major research institutions to produce inexpensive, customized cDNA microarrays that are accessible to researchers in a broad range of fields. These high-throughput platforms have generated a massive onslaught of data, which threatens to overwhelm researchers. Although microarrays show great promise, the technology has not matured to the point of consistently generating robust and reliable data when used in the average laboratory. This article addresses several aspects related to the handling of the deluge of microarray data and extracting reliable information from these data. We review the essential elements of data acquisition, data processing and data analysis, and briefly discuss issues related to the quality, validation and storage of data. Our goal is to point out some of the problems that must be overcome before this promising technology can achieve its full potential.
Collapse
Affiliation(s)
- K R Hess
- Dept of Biostatistics, University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA.
| | | | | | | | | |
Collapse
|
87
|
Abstract
Microarray data analysis can be divided into two tasks: grouping of genes to discover broad patterns of biological behaviour, and filtering of genes to identify specific genes of interest. Whereas the gene-grouping task is largely addressed by cluster analysis, the gene-filtering task relies primarily on hypothesis testing. This review article surveys analytical methods for the gene-filtering task. Various types of data analysis are discussed for four basic types of experimental protocols: a comparison of two biological samples; a comparison of two biological conditions; each represented by a set of replicate samples; a comparison of multiple biological conditions; and analysis of covariate information.
Collapse
Affiliation(s)
- T D Wu
- Department of Bioinformatics, Genentech, Inc., South San Francisco, CA 94080, USA.
| |
Collapse
|
88
|
Clark MD, Hennig S, Herwig R, Clifton SW, Marra MA, Lehrach H, Johnson SL. An oligonucleotide fingerprint normalized and expressed sequence tag characterized zebrafish cDNA library. Genome Res 2001; 11:1594-602. [PMID: 11544204 PMCID: PMC311136 DOI: 10.1101/gr.186901] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
The zebrafish is a powerful system for understanding the vertebrate genome, allowing the combination of genetic, molecular, and embryological analysis. Expressed sequence tags (ESTs) provide a rapid means of identifying an organism's genes for further analysis, but any EST project is limited by the availability of suitable libraries. Such cDNA libraries must be of high quality and provide a high rate of gene discovery. However, commonly used normalization and subtraction procedures tend to select for shorter, truncated, and internally primed inserts, seriously affecting library quality. An alternative procedure is to use oligonucleotide fingerprinting (OFP) to precluster clones before EST sequencing, thereby reducing the re-sequencing of common transcripts. Here, we describe the use of OFP to normalize and subtract 75,000 clones from two cDNA libraries, to a minimal set of 25,102 clones. We generated 25,788 ESTs (11,380 3' and 14,408 5') from over 16,000 of these clones. Clustering of 10,654 high-quality 3' ESTs from this set identified 7232 clusters (likely genes), corresponding to a 68% gene diversity rate, comparable to what has been reported for the best normalized human cDNA libraries, and indicating that the complete set of 25,102 clones contains as many as 17,000 genes. Yet, the library quality remains high. The complete set of 25,102 clones is available for researchers as glycerol stocks, filters sets, and as individual EST clones. These resources have been used for radiation hybrid, genetic, and physical mapping of the zebrafish genome, as well as positional cloning and candidate gene identification, molecular marker, and microarray development.
Collapse
Affiliation(s)
- M D Clark
- Max-Planck-Institut für Molekulare Genetik, 14195 Berlin, Germany.
| | | | | | | | | | | | | |
Collapse
|
89
|
Konu Ö, Kane JK, Barrett T, Vawter MP, Chang R, Ma JZ, Donovan DM, Sharp B, Becker KG, Li MD. Region-specific transcriptional response to chronic nicotine in rat brain. Brain Res 2001; 909:194-203. [PMID: 11478936 PMCID: PMC3098570 DOI: 10.1016/s0006-8993(01)02685-3] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Even though nicotine has been shown to modulate mRNA expression of a variety of genes, a comprehensive high-throughput study of the effects of nicotine on the tissue-specific gene expression profiles has been lacking in the literature. In this study, cDNA microarrays containing 1117 genes and ESTs were used to assess the transcriptional response to chronic nicotine treatment in rat, based on four brain regions, i.e. prefrontal cortex (PFC), nucleus accumbens (NAs), ventral tegmental area (VTA), and amygdala (AMYG). On the basis of a non-parametric resampling method, an index (called jackknifed reliability index, JRI) was proposed, and employed to determine the inherent measurement error across multiple arrays used in this study. Upon removal of the outliers, the mean correlation coefficient between duplicate measurements increased to 0.978+/-0.0035 from 0.941+/-0.045. Results from principal component analysis and pairwise correlations suggested that brain regions studied were highly similar in terms of their absolute expression levels, but exhibited divergent transcriptional responses to chronic nicotine administration. For example, PFC and NAs were significantly more similar to each other (r=0.7; P<10(-14)) than to either VTA or AMYG. Furthermore, we confirmed our microarray results for two representative genes, i.e. the weak inward rectifier K(+) channel (TWIK-1), and phosphate and tensin homolog (PTEN) by using real-time quantitative RT-PCR technique. Finally, a number of genes, involved in MAPK, phosphatidylinositol, and EGFR signaling pathways, were identified and proposed as possible targets in response to nicotine administration.
Collapse
Affiliation(s)
- Özlen Konu
- Department of Pharmacology, University of Tennessee College of Medicine, 874 Union Avenue, Memphis, TN 38163, USA
| | - Justin K. Kane
- Department of Pharmacology, University of Tennessee College of Medicine, 874 Union Avenue, Memphis, TN 38163, USA
| | - Tanya Barrett
- National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA
| | - Marquis P. Vawter
- National Institute on Drug Abuse, National Institutes of Health, Baltimore, MD 21224, USA
| | - Ruying Chang
- Department of Pharmacology, University of Tennessee College of Medicine, 874 Union Avenue, Memphis, TN 38163, USA
| | - Jennie Z. Ma
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee College of Medicine, Memphis, TN 38163, USA
| | - David M. Donovan
- National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA
| | - Burt Sharp
- Department of Pharmacology, University of Tennessee College of Medicine, 874 Union Avenue, Memphis, TN 38163, USA
| | - Kevin G. Becker
- National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA
| | - Ming D. Li
- Department of Pharmacology, University of Tennessee College of Medicine, 874 Union Avenue, Memphis, TN 38163, USA
- Corresponding author. Tel.: +1-901-448-6019; fax: +1-901-448-7206. (M.D. Li)
| |
Collapse
|
90
|
Abstract
Many new gene products are being discovered by large-scale genomics and proteomics strategies, the challenge is now to develop high throughput approaches to systematically analyse these proteins and to assign a biological function to them. Having access to these gene products as recombinantly expressed proteins, would allow them to be robotically arrayed to generate protein chips. Other applications include using these proteins for the generation of specific antibodies, which can also be arrayed to produce antibody chips. The availability of such protein and antibody arrays would facilitate the simultaneous analysis of thousands of interactions within a single experiment. This chapter will focus on current strategies used to generate protein and antibody arrays and their current applications in biological research, medicine and diagnostics. The shortcomings of these approaches, the developments required, as well as the potential applications of protein and antibody arrays will be discussed.
Collapse
Affiliation(s)
- D J Cahill
- Max-Planck-Institute of Molecular Genetics, Ihnestrasse 73, D-14195, Berlin, Germany.
| |
Collapse
|
91
|
Dopazo J, Zanders E, Dragoni I, Amphlett G, Falciani F. Methods and approaches in the analysis of gene expression data. J Immunol Methods 2001; 250:93-112. [PMID: 11251224 DOI: 10.1016/s0022-1759(01)00307-6] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
The application of high-density DNA array technology to monitor gene transcription has been responsible for a real paradigm shift in biology. The majority of research groups now have the ability to measure the expression of a significant proportion of the human genome in a single experiment, resulting in an unprecedented volume of data being made available to the scientific community. As a consequence of this, the storage, analysis and interpretation of this information present a major challenge. In the field of immunology the analysis of gene expression profiles has opened new areas of investigation. The study of cellular responses has revealed that cells respond to an activation signal with waves of co-ordinated gene expression profiles and that the components of these responses are the key to understanding the specific mechanisms which lead to phenotypic differentiation. The discovery of 'cell type specific' gene expression signatures have also helped the interpretation of the mechanisms leading to disease progression. Here we review the principles behind the most commonly used data analysis methods and discuss the approaches that have been employed in immunological research.
Collapse
Affiliation(s)
- J Dopazo
- Bioinformatica, Centro Nacional de Investigaciones Oncológicas Carlos III, Ctra. Majadahonda-Pozuelo, Km. 2 Majadahonda 28220, Madrid, Spain
| | | | | | | | | |
Collapse
|
92
|
|