1
|
Li Z, Song C, Yang J, Jia Z, Chen D, Yan C, Tian L, Wu X. Clustering algorithm based on DINNSM and its application in gene expression data analysis. Technol Health Care 2024; 32:229-239. [PMID: 38759052 PMCID: PMC11191479 DOI: 10.3233/thc-248020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024]
Abstract
BACKGROUND Selecting an appropriate similarity measurement method is crucial for obtaining biologically meaningful clustering modules. Commonly used measurement methods are insufficient in capturing the complexity of biological systems and fail to accurately represent their intricate interactions. OBJECTIVE This study aimed to obtain biologically meaningful gene modules by using the clustering algorithm based on a similarity measurement method. METHODS A new algorithm called the Dual-Index Nearest Neighbor Similarity Measure (DINNSM) was proposed. This algorithm calculated the similarity matrix between genes using Pearson's or Spearman's correlation. It was then used to construct a nearest-neighbor table based on the similarity matrix. The final similarity matrix was reconstructed using the positions of shared genes in the nearest neighbor table and the number of shared genes. RESULTS Experiments were conducted on five different gene expression datasets and compared with five widely used similarity measurement techniques for gene expression data. The findings demonstrate that when utilizing DINNSM as the similarity measure, the clustering results performed better than using alternative measurement techniques. CONCLUSIONS DINNSM provided more accurate insights into the intricate biological connections among genes, facilitating the identification of more accurate and biological gene co-expression modules.
Collapse
Affiliation(s)
- Zongjin Li
- Department of Computer, Qinghai Normal University, Xining, China
| | - Changxin Song
- Department of Mechanical Engineering and Information, Shanghai Urban Construction Vocational College, Shanghai, China
| | - Jiyu Yang
- Department of Cardiovascular Medicine, Xining First People’s Hospital, Xining, China
| | - Zeyu Jia
- Department of Computer, Qinghai Normal University, Xining, China
| | - Dongzhen Chen
- School of Materials Science and Engineering, Xi’an Polytechnic University, Xi’an, China
| | - Chengying Yan
- Department of Cardiovascular Medicine, Xining First People’s Hospital, Xining, China
| | - Liqin Tian
- Department of Computer, Qinghai Normal University, Xining, China
- School of Computer, North China Institute of Science and Technology, Langfang, China
| | - Xiaoming Wu
- The Key Laboratory of Biomedical Information Engineering of Ministry of Education, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
2
|
Singh V, Verma NK. Gene Expression Data Analysis Using Feature Weighted Robust Fuzzy c-Means Clustering. IEEE Trans Nanobioscience 2022; PP:99-105. [PMID: 35259111 DOI: 10.1109/tnb.2022.3157396] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Clustering of gene expression data has been proven to be very useful in various applications, i.e., identifying the natural structure inherent in gene expression, understanding gene functions, mining relevant information from noisy data, and understanding gene regulation. In all these applications, genes, i.e., features, play a crucial role in characterizing them into different groups. These features may be relevant, irrelevant, or redundant, but they have different contributions during the clustering process. This paper presents a novel approach by considering the effect of features during the clustering process. In the proposed method, the fuzzy c-means the objective function is modified using a weighted Euclidean distance between the features with a monotonically decreasing function. The monotonically decreasing function helps control the features' contribution during the clustering process to partition the data into more relevant clusters. The proposed approach is validated, and performance is presented in various clustering performance measures on the different standard datasets. These clustering performance measures have also been compared with multiple state-of-the-art methods.
Collapse
|
3
|
Application of Systems Engineering Principles and Techniques in Biological Big Data Analytics: A Review. Processes (Basel) 2020. [DOI: 10.3390/pr8080951] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
In the past few decades, we have witnessed tremendous advancements in biology, life sciences and healthcare. These advancements are due in no small part to the big data made available by various high-throughput technologies, the ever-advancing computing power, and the algorithmic advancements in machine learning. Specifically, big data analytics such as statistical and machine learning has become an essential tool in these rapidly developing fields. As a result, the subject has drawn increased attention and many review papers have been published in just the past few years on the subject. Different from all existing reviews, this work focuses on the application of systems, engineering principles and techniques in addressing some of the common challenges in big data analytics for biological, biomedical and healthcare applications. Specifically, this review focuses on the following three key areas in biological big data analytics where systems engineering principles and techniques have been playing important roles: the principle of parsimony in addressing overfitting, the dynamic analysis of biological data, and the role of domain knowledge in biological data analytics.
Collapse
|
4
|
Aggregation of multi-objective fuzzy symmetry-based clustering techniques for improving gene and cancer classification. Soft comput 2018. [DOI: 10.1007/s00500-017-2865-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
5
|
Parallel swarm intelligence strategies for large-scale clustering based on MapReduce with application to epigenetics of aging. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.04.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
6
|
Gao Q, Ostendorf E, Cruz JA, Jin R, Kramer DM, Chen J. Inter-functional analysis of high-throughput phenotype data by non-parametric clustering and its application to photosynthesis. Bioinformatics 2015; 32:67-76. [PMID: 26342101 DOI: 10.1093/bioinformatics/btv515] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 08/25/2015] [Indexed: 01/20/2023] Open
Abstract
MOTIVATION Phenomics is the study of the properties and behaviors of organisms (i.e. their phenotypes) on a high-throughput scale. New computational tools are needed to analyze complex phenomics data, which consists of multiple traits/behaviors that interact with each other and are dependent on external factors, such as genotype and environmental conditions, in a way that has not been well studied. RESULTS We deployed an efficient framework for partitioning complex and high dimensional phenotype data into distinct functional groups. To achieve this, we represented measured phenotype data from each genotype as a cloud-of-points, and developed a novel non-parametric clustering algorithm to cluster all the genotypes. When compared with conventional clustering approaches, the new method is advantageous in that it makes no assumption about the parametric form of the underlying data distribution and is thus particularly suitable for phenotype data analysis. We demonstrated the utility of the new clustering technique by distinguishing novel phenotypic patterns in both synthetic data and a high-throughput plant photosynthetic phenotype dataset. We biologically verified the clustering results using four Arabidopsis chloroplast mutant lines. AVAILABILITY AND IMPLEMENTATION Software is available at www.msu.edu/~jinchen/NPM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT jinchen@msu.edu, kramerd8@cns.msu.edu or rongjin@cse.msu.edu.
Collapse
Affiliation(s)
- Qiaozi Gao
- Department of Computer Science and Engineering
| | | | | | - Rong Jin
- Department of Computer Science and Engineering
| | - David M Kramer
- Department of Energy Plant Research Laboratory and Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Jin Chen
- Department of Computer Science and Engineering, Department of Energy Plant Research Laboratory and
| |
Collapse
|
7
|
|
8
|
Semi-supervised clustering for gene-expression data in multiobjective optimization framework. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0335-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
9
|
Wang X, Liu A. Expression of Genes Controlling Unsaturated Fatty Acids Biosynthesis and Oil Deposition in Developing Seeds of Sacha Inchi (Plukenetia volubilis L.). Lipids 2014; 49:1019-31. [DOI: 10.1007/s11745-014-3938-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Accepted: 07/17/2014] [Indexed: 02/03/2023]
|
10
|
Doostparast Torshizi A, Fazel Zarandi MH. Alpha-plane based automatic general type-2 fuzzy clustering based on simulated annealing meta-heuristic algorithm for analyzing gene expression data. Comput Biol Med 2014; 64:347-59. [PMID: 25035233 DOI: 10.1016/j.compbiomed.2014.06.017] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Revised: 06/17/2014] [Accepted: 06/21/2014] [Indexed: 10/25/2022]
Abstract
This paper considers microarray gene expression data clustering using a novel two stage meta-heuristic algorithm based on the concept of α-planes in general type-2 fuzzy sets. The main aim of this research is to present a powerful data clustering approach capable of dealing with highly uncertain environments. In this regard, first, a new objective function using α-planes for general type-2 fuzzy c-means clustering algorithm is represented. Then, based on the philosophy of the meta-heuristic optimization framework 'Simulated Annealing', a two stage optimization algorithm is proposed. The first stage of the proposed approach is devoted to the annealing process accompanied by its proposed perturbation mechanisms. After termination of the first stage, its output is inserted to the second stage where it is checked with other possible local optima through a heuristic algorithm. The output of this stage is then re-entered to the first stage until no better solution is obtained. The proposed approach has been evaluated using several synthesized datasets and three microarray gene expression datasets. Extensive experiments demonstrate the capabilities of the proposed approach compared with some of the state-of-the-art techniques in the literature.
Collapse
|
11
|
Doostparast Torshizi A, Fazel Zarandi MH. A new cluster validity measure based on general type-2 fuzzy sets: Application in gene expression data clustering. Knowl Based Syst 2014. [DOI: 10.1016/j.knosys.2014.03.023] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
12
|
Big data analysis using modern statistical and machine learning methods in medicine. Int Neurourol J 2014; 18:50-7. [PMID: 24987556 PMCID: PMC4076480 DOI: 10.5213/inj.2014.18.2.50] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Accepted: 06/20/2014] [Indexed: 11/08/2022] Open
Abstract
In this article we introduce modern statistical machine learning and bioinformatics approaches that have been used in learning statistical relationships from big data in medicine and behavioral science that typically include clinical, genomic (and proteomic) and environmental variables. Every year, data collected from biomedical and behavioral science is getting larger and more complicated. Thus, in medicine, we also need to be aware of this trend and understand the statistical tools that are available to analyze these datasets. Many statistical analyses that are aimed to analyze such big datasets have been introduced recently. However, given many different types of clinical, genomic, and environmental data, it is rather uncommon to see statistical methods that combine knowledge resulting from those different data types. To this extent, we will introduce big data in terms of clinical data, single nucleotide polymorphism and gene expression studies and their interactions with environment. In this article, we will introduce the concept of well-known regression analyses such as linear and logistic regressions that has been widely used in clinical data analyses and modern statistical models such as Bayesian networks that has been introduced to analyze more complicated data. Also we will discuss how to represent the interaction among clinical, genomic, and environmental data in using modern statistical models. We conclude this article with a promising modern statistical method called Bayesian networks that is suitable in analyzing big data sets that consists with different type of large data from clinical, genomic, and environmental data. Such statistical model form big data will provide us with more comprehensive understanding of human physiology and disease.
Collapse
|
13
|
Gene expression data clustering using a multiobjective symmetry based clustering technique. Comput Biol Med 2013; 43:1965-77. [DOI: 10.1016/j.compbiomed.2013.07.021] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Revised: 07/09/2013] [Accepted: 07/17/2013] [Indexed: 11/21/2022]
|
14
|
Abstract
Applications of clustering algorithms in biomedical research are ubiquitous, with typical examples including gene expression data analysis, genomic sequence analysis, biomedical document mining, and MRI image analysis. However, due to the diversity of cluster analysis, the differing terminologies, goals, and assumptions underlying different clustering algorithms can be daunting. Thus, determining the right match between clustering algorithms and biomedical applications has become particularly important. This paper is presented to provide biomedical researchers with an overview of the status quo of clustering algorithms, to illustrate examples of biomedical applications based on cluster analysis, and to help biomedical researchers select the most suitable clustering algorithms for their own applications.
Collapse
Affiliation(s)
- Rui Xu
- Industrial Artificial Intelligence Laboratory, GE Global Research Center, Niskayuna, NY 12309, USA.
| | | |
Collapse
|
15
|
Zhang Q, Fan X, Wang Y, Sun M, Sun SSM, Guo D. A model-based method for gene dependency measurement. PLoS One 2012; 7:e40918. [PMID: 22829898 PMCID: PMC3400631 DOI: 10.1371/journal.pone.0040918] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2011] [Accepted: 06/19/2012] [Indexed: 02/02/2023] Open
Abstract
Many computational methods have been widely used to identify transcription regulatory interactions based on gene expression profiles. The selection of dependency measure is very important for successful regulatory network inference. In this paper, we develop a new method-DBoMM (Difference in BIC of Mixture Models)-for estimating dependency of gene by fitting the gene expression profiles into mixture Gaussian models. We show that DBoMM out-performs 4 other existing methods, including Kendall's tau correlation (TAU), Pearson Correlation (COR), Euclidean distance (EUC) and Mutual information (MI) using Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Arabidopsis thaliana data and synthetic data. DBoMM can also identify condition-dependent regulatory interactions and is robust to noisy data. Of the 741 Escherichia coli regulatory interactions inferred by DBoMM at a 60% true positive rate, 65 are previously known interactions and 676 are novel predictions. To validate the new prediction, the promoter sequences of target genes regulated by the same transcription factors were analyzed and significant motifs were identified.
Collapse
Affiliation(s)
- Qing Zhang
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Yejun Wang
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Mingan Sun
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Samuel S. M. Sun
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Dianjing Guo
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| |
Collapse
|
16
|
Aittokallio T, Kurki M, Nevalainen O, Nikula T, West A, Lahesmaa R. Computational Strategies for Analyzing Data in Gene Expression Microarray Experiments. J Bioinform Comput Biol 2012; 1:541-86. [PMID: 15290769 DOI: 10.1142/s0219720003000319] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2003] [Revised: 07/02/2003] [Indexed: 11/18/2022]
Abstract
Microarray analysis has become a widely used method for generating gene expression data on a genomic scale. Microarrays have been enthusiastically applied in many fields of biological research, even though several open questions remain about the analysis of such data. A wide range of approaches are available for computational analysis, but no general consensus exists as to standard for microarray data analysis protocol. Consequently, the choice of data analysis technique is a crucial element depending both on the data and on the goals of the experiment. Therefore, basic understanding of bioinformatics is required for optimal experimental design and meaningful interpretation of the results. This review summarizes some of the common themes in DNA microarray data analysis, including data normalization and detection of differential expression. Algorithms are demonstrated by analyzing cDNA microarray data from an experiment monitoring gene expression in T helper cells. Several computational biology strategies, along with their relative merits, are overviewed and potential areas for additional research discussed. The goal of the review is to provide a computational framework for applying and evaluating such bioinformatics strategies. Solid knowledge of microarray informatics contributes to the implementation of more efficient computational protocols for the given data obtained through microarray experiments.
Collapse
Affiliation(s)
- Tero Aittokallio
- Department of Computational Biology, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-Shi, Chiba 277-8562, Japan.
| | | | | | | | | | | |
Collapse
|
17
|
ROMDHANE LOTFIBEN, SHILI HECHMI, AYEB BECHIR. P3M— POSSIBILISTIC MULTI-STEP MAXMIN AND MERGING ALGORITHM WITH APPLICATION TO GENE EXPRESSION DATA MINING. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213009000263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene expression data generated by DNA microarray experiments provide a vast resource of medical diagnostic and disease understanding. Unfortunately, the large amount of data makes it hard, sometimes impossible, to understand the correct behavior of genes. In this work, we develop a possibilistic approach for mining gene microarray data. Our model consists of two steps. In the first step, we use possibilistic clustering to partition the data into groups (or clusters). The optimal number of clusters is evaluated automatically from data using the Partition Information Entropy as a validity measure. In the second step, we select from each computed cluster the most representative genes and model them as a graph called a proximity graph. This set of graphs (or hyper-graph) will be used to predict the function of new and previously unknown genes. Benchmark results on real-world data sets reveal a good performance of our model in computing optimal partitions even in the presence of noise; and a high prediction accuracy on unknown genes.
Collapse
Affiliation(s)
- LOTFI BEN ROMDHANE
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| | - HECHMI SHILI
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| | - BECHIR AYEB
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| |
Collapse
|
18
|
Guan X, Lee JJ, Pang M, Shi X, Stelly DM, Chen ZJ. Activation of Arabidopsis seed hair development by cotton fiber-related genes. PLoS One 2011; 6:e21301. [PMID: 21779324 PMCID: PMC3136922 DOI: 10.1371/journal.pone.0021301] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Accepted: 05/25/2011] [Indexed: 02/07/2023] Open
Abstract
Each cotton fiber is a single-celled seed trichome or hair, and over 20,000 fibers may develop semi-synchronously on each seed. The molecular basis for seed hair development is unknown but is likely to share many similarities with leaf trichome development in Arabidopsis. Leaf trichome initiation in Arabidopsis thaliana is activated by GLABROUS1 (GL1) that is negatively regulated by TRIPTYCHON (TRY). Using laser capture microdissection and microarray analysis, we found that many putative MYB transcription factor and structural protein genes were differentially expressed in fiber and non-fiber tissues. Gossypium hirsutum MYB2 (GhMYB2), a putative GL1 homolog, and its downstream gene, GhRDL1, were highly expressed during fiber cell initiation. GhRDL1, a fiber-related gene with unknown function, was predominately localized around cell walls in stems, sepals, seed coats, and pollen grains. GFP:GhRDL1 and GhMYB2:YFP were co-localized in the nuclei of ectopic trichomes in siliques. Overexpressing GhRDL1 or GhMYB2 in A. thaliana Columbia-0 (Col-0) activated fiber-like hair production in 4–6% of seeds and had on obvious effects on trichome development in leaves or siliques. Co-overexpressing GhRDL1 and GhMYB2 in A. thaliana Col-0 plants increased hair formation in ∼8% of seeds. Overexpressing both GhRDL1 and GhMYB2 in A. thaliana Col-0 try mutant plants produced seed hair in ∼10% of seeds as well as dense trichomes inside and outside siliques, suggesting synergistic effects of GhRDL1 and GhMYB2 with try on development of trichomes inside and outside of siliques and seed hair in A. thaliana. These data suggest that a different combination of factors is required for the full development of trichomes (hairs) in leaves, siliques, and seeds. A. thaliana can be developed as a model a system for discovering additional genes that control seed hair development in general and cotton fiber in particular.
Collapse
Affiliation(s)
- Xueying Guan
- Section of Molecular Cell and Developmental Biology and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Jinsuk J. Lee
- Section of Molecular Cell and Developmental Biology and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Mingxiong Pang
- Section of Molecular Cell and Developmental Biology and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Xiaoli Shi
- Section of Molecular Cell and Developmental Biology and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - David M. Stelly
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Z. Jeffrey Chen
- Section of Molecular Cell and Developmental Biology and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
- * E-mail:
| |
Collapse
|
19
|
Wiltgen M, Tilz GP. Molecular diagnosis and prognosis with DNA microarrays. Hematology 2011; 16:166-76. [PMID: 21669057 DOI: 10.1179/102453311x12953015767257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022] Open
Abstract
Microarray analysis makes it possible to determine thousands of gene expression values simultaneously. Changes in gene expression, as a response to diseases, can be detected allowing a better understanding and differentiation of diseases at a molecular level. By comparing different kinds of tissue, for example healthy tissue and cancer tissue, the microarray analysis indicates induced gene activity, repressed gene activity or when there is no change in the gene activity level. Fundamental patterns in gene expression are extracted by several clustering and machine learning algorithms. Certain kinds of cancer can be divided into subtypes, with different clinical outcomes, by their specific gene expression patterns. This enables a better diagnosis and tailoring of individual patient treatments.
Collapse
Affiliation(s)
- Marco Wiltgen
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.
| | | |
Collapse
|
20
|
Peter W, Najmi AH, Burkom HS. Reducing false alarms in syndromic surveillance. Stat Med 2011; 30:1665-77. [PMID: 21432890 DOI: 10.1002/sim.4204] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2008] [Accepted: 12/09/2010] [Indexed: 11/11/2022]
Abstract
Algorithms for identifying public health threats or disease outbreaks are vulnerable to false alarms arising from sudden shifts in health-care utilization or data participation. This paper describes a method of reducing false alerts in automated public health surveillance algorithms, and in particular, automated syndromic surveillance algorithms, that rely on health-care utilization data. The technique is based on monitoring syndromic counts with reference to a suitable background, or reference, series of counts. The suitability of the background time series in decreasing the false-alarm rate will be shown to be related mathematically to the so-called mutual information that exists between the random variables representing the syndromic and background time series of counts. The method can be understood as a noise cancellation filter technique in which one noisy (reference) channel is used to cancel the background noise of the monitored (measured) channel. The issues discussed here may also be relevant to the appropriate use of rates in epidemiology and biostatistics.
Collapse
Affiliation(s)
- William Peter
- Applied Physics Laboratory, Johns Hopkins University, 11100 Johns Hopkins Road, Laurel, MD 20723, USA.
| | | | | |
Collapse
|
21
|
Xue Y, Liu Z, Cao J, Ma Q, Gao X, Wang Q, Jin C, Zhou Y, Wen L, Ren J. GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Protein Eng Des Sel 2010; 24:255-60. [PMID: 21062758 DOI: 10.1093/protein/gzq094] [Citation(s) in RCA: 197] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
As the most important post-translational modification of proteins, phosphorylation plays essential roles in all aspects of biological processes. Besides experimental approaches, computational prediction of phosphorylated proteins with their kinase-specific phosphorylation sites has also emerged as a popular strategy, for its low-cost, fast-speed and convenience. In this work, we developed a kinase-specific phosphorylation sites predictor of GPS 2.1 (Group-based Prediction System), with a novel but simple approach of motif length selection (MLS). By this approach, the robustness of the prediction system was greatly improved. All algorithms in GPS old versions were also reserved and integrated in GPS 2.1. The online service and local packages of GPS 2.1 were implemented in JAVA 1.5 (J2SE 5.0) and freely available for academic researches at: http://gps.biocuckoo.org.
Collapse
Affiliation(s)
- Yu Xue
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Department of Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Broad conservation of milk utilization genes in Bifidobacterium longum subsp. infantis as revealed by comparative genomic hybridization. Appl Environ Microbiol 2010; 76:7373-81. [PMID: 20802066 DOI: 10.1128/aem.00675-10] [Citation(s) in RCA: 165] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Human milk oligosaccharides (HMOs) are the third-largest solid component of milk. Their structural complexity renders them nondigestible to the host but liable to hydrolytic enzymes of the infant colonic microbiota. Bifidobacteria and, frequently, Bifidobacterium longum strains predominate the colonic microbiota of exclusively breast-fed infants. Among the three recognized subspecies of B. longum, B. longum subsp. infantis achieves high levels of cell growth on HMOs and is associated with early colonization of the infant gut. The B. longum subsp. infantis ATCC 15697 genome features five distinct gene clusters with the predicted capacity to bind, cleave, and import milk oligosaccharides. Comparative genomic hybridizations (CGHs) were used to associate genotypic biomarkers among 15 B. longum strains exhibiting various HMO utilization phenotypes and host associations. Multilocus sequence typing provided taxonomic subspecies designations and grouped the strains between B. longum subsp. infantis and B. longum subsp. longum. CGH analysis determined that HMO utilization gene regions are exclusively conserved across all B. longum subsp. infantis strains capable of growth on HMOs and have diverged in B. longum subsp. longum strains that cannot grow on HMOs. These regions contain fucosidases, sialidases, glycosyl hydrolases, ABC transporters, and family 1 solute binding proteins and are likely needed for efficient metabolism of HMOs. Urea metabolism genes and their activity were exclusively conserved in B. longum subsp. infantis. These results imply that the B. longum has at least two distinct subspecies: B. longum subsp. infantis, specialized to utilize milk carbon, and B. longum subsp. longum, specialized for plant-derived carbon metabolism.
Collapse
|
23
|
Xue Y, Liu Z, Gao X, Jin C, Wen L, Yao X, Ren J. GPS-SNO: computational prediction of protein S-nitrosylation sites with a modified GPS algorithm. PLoS One 2010; 5:e11290. [PMID: 20585580 PMCID: PMC2892008 DOI: 10.1371/journal.pone.0011290] [Citation(s) in RCA: 177] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2009] [Accepted: 06/04/2010] [Indexed: 11/18/2022] Open
Abstract
As one of the most important and ubiquitous post-translational modifications (PTMs) of proteins, S-nitrosylation plays important roles in a variety of biological processes, including the regulation of cellular dynamics and plasticity. Identification of S-nitrosylated substrates with their exact sites is crucial for understanding the molecular mechanisms of S-nitrosylation. In contrast with labor-intensive and time-consuming experimental approaches, prediction of S-nitrosylation sites using computational methods could provide convenience and increased speed. In this work, we developed a novel software of GPS-SNO 1.0 for the prediction of S-nitrosylation sites. We greatly improved our previously developed algorithm and released the GPS 3.0 algorithm for GPS-SNO. By comparison, the prediction performance of GPS 3.0 algorithm was better than other methods, with an accuracy of 75.80%, a sensitivity of 53.57% and a specificity of 80.14%. As an application of GPS-SNO 1.0, we predicted putative S-nitrosylation sites for hundreds of potentially S-nitrosylated substrates for which the exact S-nitrosylation sites had not been experimentally determined. In this regard, GPS-SNO 1.0 should prove to be a useful tool for experimentalists. The online service and local packages of GPS-SNO were implemented in JAVA and are freely available at: http://sno.biocuckoo.org/.
Collapse
Affiliation(s)
- Yu Xue
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Department of Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Zexian Liu
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Xinjiao Gao
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Changjiang Jin
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Longping Wen
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Xuebiao Yao
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Jian Ren
- Life Sciences School, Sun Yat-sen University (SYSU), Guangzhou, Guangdong, China
| |
Collapse
|
24
|
Assessing functional annotation transfers with inter-species conserved coexpression: application to Plasmodium falciparum. BMC Genomics 2010; 11:35. [PMID: 20078859 PMCID: PMC2826313 DOI: 10.1186/1471-2164-11-35] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2009] [Accepted: 01/15/2010] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Plasmodium falciparum is the main causative agent of malaria. Of the 5 484 predicted genes of P. falciparum, about 57% do not have sufficient sequence similarity to characterized genes in other species to warrant functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. Gene expression data have been widely used in the recent years to help functional annotation in an intra-species way via the so-called Guilt By Association (GBA) principle. RESULTS We propose a new method that uses gene expression data to assess inter-species annotation transfers. Our approach starts from a set of likely orthologs between a reference species (here S. cerevisiae and D. melanogaster) and a query species (P. falciparum). It aims at identifying clusters of coexpressed genes in the query species whose coexpression has been conserved in the reference species. These conserved clusters of coexpressed genes are then used to assess annotation transfers between genes with low sequence similarity, enabling reliable transfers of annotations from the reference to the query species. The approach was used with transcriptomic data sets of P. falciparum, S. cerevisiae and D. melanogaster, and enabled us to propose with high confidence new/refined annotations for several dozens hypothetical/putative P. falciparum genes. Notably, we revised the annotation of genes involved in ribosomal proteins and ribosome biogenesis and assembly, thus highlighting several potential drug targets. CONCLUSIONS Our approach uses both sequence similarity and gene expression data to help inter-species gene annotation transfers. Experiments show that this strategy improves the accuracy achieved when using solely sequence similarity and outperforms the accuracy of the GBA approach. In addition, our experiments with P. falciparum show that it can infer a function for numerous hypothetical genes.
Collapse
|
25
|
Wan R, Kiseleva L, Harada H, Mamitsuka H, Horton P. HAMSTER: visualizing microarray experiments as a set of minimum spanning trees. SOURCE CODE FOR BIOLOGY AND MEDICINE 2009; 4:8. [PMID: 19925686 PMCID: PMC2784758 DOI: 10.1186/1751-0473-4-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2009] [Accepted: 11/20/2009] [Indexed: 11/17/2022]
Abstract
Background Visualization tools allow researchers to obtain a global view of the interrelationships between the probes or experiments of a gene expression (e.g. microarray) data set. Some existing methods include hierarchical clustering and k-means. In recent years, others have proposed applying minimum spanning trees (MST) for microarray clustering. Although MST-based clustering is formally equivalent to the dendrograms produced by hierarchical clustering under certain conditions; visually they can be quite different. Methods HAMSTER (Helpful Abstraction using Minimum Spanning Trees for Expression Relations) is an open source system for generating a set of MSTs from the experiments of a microarray data set. While previous works have generated a single MST from a data set for data clustering, we recursively merge experiments and repeat this process to obtain a set of MSTs for data visualization. Depending on the parameters chosen, each tree is analogous to a snapshot of one step of the hierarchical clustering process. We scored and ranked these trees using one of three proposed schemes. HAMSTER is implemented in C++ and makes use of Graphviz for laying out each MST. Results We report on the running time of HAMSTER and demonstrate using data sets from the NCBI Gene Expression Omnibus (GEO) that the images created by HAMSTER offer insights that differ from the dendrograms of hierarchical clustering. In addition to the C++ program which is available as open source, we also provided a web-based version (HAMSTER+) which allows users to apply our system through a web browser without any computer programming knowledge. Conclusion Researchers may find it helpful to include HAMSTER in their microarray analysis workflow as it can offer insights that differ from hierarchical clustering. We believe that HAMSTER would be useful for certain types of gradient data sets (e.g time-series data) and data that indicate relationships between cells/tissues. Both the source and the web server variant of HAMSTER are available from http://hamster.cbrc.jp/.
Collapse
Affiliation(s)
- Raymond Wan
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, 611-0011, Japan.
| | | | | | | | | |
Collapse
|
26
|
Rasche A, Herwig R. ARH: predicting splice variants from genome-wide data with modified entropy. ACTA ACUST UNITED AC 2009; 26:84-90. [PMID: 19889797 DOI: 10.1093/bioinformatics/btp626] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Exon arrays allow the quantitative study of alternative splicing (AS) on a genome-wide scale. A variety of splicing prediction methods has been proposed for Affymetrix exon arrays mainly focusing on geometric correlation measures or analysis of variance. In this article, we introduce an information theoretic concept that is based on modification of the well-known entropy function. RESULTS We have developed an AS robust prediction method based on entropy (ARH). We can show that this measure copes with bias inherent in the analysis of AS such as the dependency of prediction performance on the number of exons or variable exon expression. In order to judge the performance of ARH, we have compared it with eight existing splicing prediction methods using experimental benchmark data and demonstrate that ARH is a well-performing new method for the prediction of splice variants. AVAILABILITY AND IMPLEMENTATION ARH is implemented in R and provided in the Supplementary Material.
Collapse
Affiliation(s)
- Axel Rasche
- Department of Vertebrate Genomics, Max-Planck-Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany.
| | | |
Collapse
|
27
|
Exploring ant-based algorithms for gene expression data analysis. Artif Intell Med 2009; 47:105-19. [DOI: 10.1016/j.artmed.2009.03.004] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2008] [Revised: 03/17/2009] [Accepted: 03/21/2009] [Indexed: 11/23/2022]
|
28
|
Salicrú M, Vives S, Zheng T. Inferential clustering approach for microarray experiments with replicated measurements. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:594-604. [PMID: 19875858 DOI: 10.1109/tcbb.2008.106] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Cluster analysis has proven to be a useful tool for investigating the association structure among genes in a microarray data set. There is a rich literature on cluster analysis and various techniques have been developed. Such analyses heavily depend on an appropriate (dis)similarity measure. In this paper, we introduce a general clustering approach based on the confidence interval inferential methodology, which is applied to gene expression data of microarray experiments. Emphasis is placed on data with low replication (three or five replicates). The proposed method makes more efficient use of the measured data and avoids the subjective choice of a dissimilarity measure. This new methodology, when applied to real data, provides an easy-to-use bioinformatics solution for the cluster analysis of microarray experiments with replicates (see the Appendix). Even though the method is presented under the framework of microarray experiments, it is a general algorithm that can be used to identify clusters in any situation. The method's performance is evaluated using simulated and publicly available data set. Our results also clearly show that our method is not an extension of the conventional clustering method based on correlation or euclidean distance.
Collapse
Affiliation(s)
- Miquel Salicrú
- Statistics Department, Barcelona University, Avda Diagonal 645, 08028 BCN, Spain.
| | | | | |
Collapse
|
29
|
Maulik U, Mukhopadhyay A, Bandyopadhyay S. Finding multiple coherent biclusters in microarray data using variable string length multiobjective genetic algorithm. ACTA ACUST UNITED AC 2009; 13:969-75. [PMID: 19304489 DOI: 10.1109/titb.2009.2017527] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Microarray technology enables the simultaneous monitoring of the expression pattern of a huge number of genes across different experimental conditions. Biclustering in microarray data is an important technique that discovers a group of genes that are coregulated in a subset of conditions. Biclustering algorithms require to identify coherent and nontrivial biclusters, i.e., the biclusters should have low mean squared residue and high row variance. A multiobjective genetic biclustering technique is proposed here that optimizes these objectives simultaneously. A novel encoding scheme that uses variable chromosome length is developed. Moreover, a new quantitative measure to evaluate the goodness of the biclusters is proposed. The performance of the proposed algorithm has been evaluated on both simulated and real-life gene expression datasets, and compared with some other well-known biclustering techniques.
Collapse
Affiliation(s)
- Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India.
| | | | | |
Collapse
|
30
|
Katagiri F, Glazebrook J. Pattern discovery in expression profiling data. ACTA ACUST UNITED AC 2009; Chapter 22:Unit 22.5. [PMID: 19170028 DOI: 10.1002/0471142727.mb2205s85] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In expression profiling studies, it is often necessary to identify groups of genes with similar expression profiles in a variety of samples, and/or groups of samples with similar expression profiles. Each profile can be expressed as a single data point in a space with the same number of dimensions as there are parameters in the profiles. In this way, pattern discovery among expression profiles is translated into pattern discovery in the spatial distribution of data points: the similarity between profiles is defined by the distance between the corresponding data points. Various multivariate analysis methods, such as clustering and dimensionality reduction methods, are used to summarize the data point distribution to help the investigator recognize major trends. As different methods may identify different features of the distribution, it is important to analyze a particular data set with multiple methods.
Collapse
|
31
|
Romdhane LB, Shili H, Ayeb B. Mining microarray gene expression data with unsupervised possibilistic clustering and proximity graphs. APPL INTELL 2009. [DOI: 10.1007/s10489-009-0161-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
32
|
Madi A, Friedman Y, Roth D, Regev T, Bransburg-Zabary S, Jacob EB. Genome holography: deciphering function-form motifs from gene expression data. PLoS One 2008; 3:e2708. [PMID: 18628959 PMCID: PMC2444029 DOI: 10.1371/journal.pone.0002708] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2008] [Accepted: 06/19/2008] [Indexed: 12/28/2022] Open
Abstract
Background DNA chips allow simultaneous measurements of genome-wide response of thousands of genes, i.e. system level monitoring of the gene-network activity. Advanced analysis methods have been developed to extract meaningful information from the vast amount of raw gene-expression data obtained from the microarray measurements. These methods usually aimed to distinguish between groups of subjects (e.g., cancer patients vs. healthy subjects) or identifying marker genes that help to distinguish between those groups. We assumed that motifs related to the internal structure of operons and gene-networks regulation are also embedded in microarray and can be deciphered by using proper analysis. Methodology/Principal Findings The analysis presented here is based on investigating the gene-gene correlations. We analyze a database of gene expression of Bacillus subtilis exposed to sub-lethal levels of 37 different antibiotics. Using unsupervised analysis (dendrogram) of the matrix of normalized gene-gene correlations, we identified the operons as they form distinct clusters of genes in the sorted correlation matrix. Applying dimension-reduction algorithm (Principal Component Analysis, PCA) to the matrices of normalized correlations reveals functional motifs. The genes are placed in a reduced 3-dimensional space of the three leading PCA eigen-vectors according to their corresponding eigen-values. We found that the organization of the genes in the reduced PCA space recovers motifs of the operon internal structure, such as the order of the genes along the genome, gene separation by non-coding segments, and translational start and end regions. In addition to the intra-operon structure, it is also possible to predict inter-operon relationships, operons sharing functional regulation factors, and more. In particular, we demonstrate the above in the context of the competence and sporulation pathways. Conclusions/Significance We demonstrated that by analyzing gene-gene correlation from gene-expression data it is possible to identify operons and to predict unknown internal structure of operons and gene-networks regulation.
Collapse
Affiliation(s)
- Asaf Madi
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Yonatan Friedman
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Computational and Systems Biology, Massachusetts Institute of Technology (MIT), Boston, Massachusetts, United States of America
| | - Dalit Roth
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Tamar Regev
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Sharron Bransburg-Zabary
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eshel Ben Jacob
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- The Center for Theoretical and Biological Physics, University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
33
|
Katagiri F, Glazebrook J. Pattern discovery in expression profiling data. ACTA ACUST UNITED AC 2008; Chapter 22:Unit 22.5. [PMID: 18265360 DOI: 10.1002/0471142727.mb2205s69] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In expression profiling studies, it is often necessary to identify groups of genes with similar expression profiles in a variety of samples, and/or groups of samples with similar expression profiles. Each profile can be expressed as a single data point in a space with the same number of dimensions as there are parameters in the profiles. In this way, pattern discovery among expression profiles is translated into pattern discovery in the spatial distribution of data points. Hierarchical clustering is useful for clustering similarly behaving genes or samples at local levels and for displaying the results in a simple color-coded manner. K-means clustering can be used for discovery of well-defined clusters. Principal component analysis and self-organizing maps can be used for dimensionality reduction, thereby facilitating visualization of major trends in data sets.
Collapse
|
34
|
Ray SS, Bandyopadhyay S, Pal SK. Dynamic range-based distance measure for microarray expressions and a fast gene-ordering algorithm. ACTA ACUST UNITED AC 2007; 37:742-9. [PMID: 17550128 DOI: 10.1109/tsmcb.2006.889812] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This investigation deals with a new distance measure for genes using their microarray expressions and a new algorithm for fast gene ordering without clustering. This distance measure is called "Maxrange distance," where the distance between two genes corresponding to a particular type of experiment is computed using a normalization factor, which is dependent on the dynamic range of the gene expression values of that experiment. The new gene-ordering method called "Minimal Neighbor" is based on the concept of nearest neighbor heuristic involving O(n2) time complexity. The superiority of this distance measure and the comparability of the ordering algorithm have been extensively established on widely studied microarray data sets by performing statistical tests. An interesting application of this ordering algorithm is also demonstrated for finding useful groups of genes within clusters obtained from a nonhierarchical clustering method like the self-organizing map.
Collapse
|
35
|
Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 2007; 8:111. [PMID: 17397530 PMCID: PMC1858704 DOI: 10.1186/1471-2105-8-111] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2006] [Accepted: 03/30/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient. RESULTS Relying on several public gene expression datasets, we evaluate the homogeneity and separation scores of different clustering solutions. It was found that the use of the MI measure yields a more significant differentiation among erroneous clustering solutions. The proposed measure was also used to analyze the performance of several known clustering algorithms. A comparative study of these algorithms reveals that their "best solutions" are ranked almost oppositely when using different distance measures, despite the found correspondence between these measures when analysing the averaged scores of groups of solutions. CONCLUSION In view of the results, further attention should be paid to the selection of a proper distance measure for analyzing the clustering of gene expression data.
Collapse
|
36
|
Chen GQ, Turner C, He X, Nguyen T, McKeon TA, Laudencia-Chingcuanco D. Expression profiles of genes involved in fatty acid and triacylglycerol synthesis in castor bean (Ricinus communis L.). Lipids 2007; 42:263-74. [PMID: 17393231 DOI: 10.1007/s11745-007-3022-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2006] [Accepted: 12/24/2006] [Indexed: 10/23/2022]
Abstract
Castor seed triacylglycerols (TAGs) contain 90% ricinoleate (12-hydroxy-oleate) which has numerous industrial applications. Due to the presence of the toxin ricin and potent allergenic 2S albumins in the seed, it is desirable to produce ricinoleate from temperate oilseeds. To identify regulatory genes or genes for enzymes that may up-regulate multiple activities or entire pathways leading to the ricinoleate and TAG synthesis, we have analyzed expression profiles of 12 castor genes involved in fatty acid and TAG synthesis using quantitative reverse transcription-polymerase chain reaction technology. A collection of castor seeds with well-defined developmental stages and morphologies was used to determine the levels of mRNA, ricinoleate and TAG. The synthesis of ricinoleate and TAG occurred when seeds progressed to stages of cellular endosperm development. Concomitantly, most of the genes increased their expression levels, but showed various temporal expression patterns and different maximum inductions ranging from 4- to 43,000-fold. Clustering analysis of the expression data indicated five gene groups with distinct temporal patterns. We identified genes involved in fatty acid biosynthesis and transport that fell into two related clusters with moderate flat-rise or concave-rise patterns, and others that were highly expressed during seed development that displayed either linear-rise or bell-shaped patterns. Castor diacylglycerol acyltransferase 1 was the only gene having a higher expression level in leaf and a declining pattern during cellular endosperm development. The relationships among gene expression, cellular endosperm development and ricinoleate/TAG accumulation are discussed.
Collapse
Affiliation(s)
- Grace Q Chen
- Western Regional Research Center, Agricultural Research Service, U.S. Department of Agriculture, 800 Buchanan St., Albany, CA 94710, USA.
| | | | | | | | | | | |
Collapse
|
37
|
Aigner T, Fundel K, Saas J, Gebhard PM, Haag J, Weiss T, Zien A, Obermayr F, Zimmer R, Bartnik E. Large-scale gene expression profiling reveals major pathogenetic pathways of cartilage degeneration in osteoarthritis. ACTA ACUST UNITED AC 2006; 54:3533-44. [PMID: 17075858 DOI: 10.1002/art.22174] [Citation(s) in RCA: 269] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
OBJECTIVE Despite many research efforts in recent decades, the major pathogenetic mechanisms of osteoarthritis (OA), including gene alterations occurring during OA cartilage degeneration, are poorly understood, and there is no disease-modifying treatment approach. The present study was therefore initiated in order to identify differentially expressed disease-related genes and potential therapeutic targets. METHODS This investigation consisted of a large gene expression profiling study performed based on 78 normal and disease samples, using a custom-made complementary DNA array covering >4,000 genes. RESULTS Many differentially expressed genes were identified, including the expected up-regulation of anabolic and catabolic matrix genes. In particular, the down-regulation of important oxidative defense genes, i.e., the genes for superoxide dismutases 2 and 3 and glutathione peroxidase 3, was prominent. This indicates that continuous oxidative stress to the cells and the matrix is one major underlying pathogenetic mechanism in OA. Also, genes that are involved in the phenotypic stability of cells, a feature that is greatly reduced in OA cartilage, appeared to be suppressed. CONCLUSION Our findings provide a reference data set on gene alterations in OA cartilage and, importantly, indicate major mechanisms underlying central cell biologic alterations that occur during the OA disease process. These results identify molecular targets that can be further investigated in the search for therapeutic interventions.
Collapse
Affiliation(s)
- Thomas Aigner
- Osteoarticular and Arthritis Research, Institute of Pathology, University of Leipzig, Liebigstrasse 26, D-04103 Leipzig, Germany.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Hess AP, Hamilton AE, Talbi S, Dosiou C, Nyegaard M, Nayak N, Genbecev-Krtolica O, Mavrogianis P, Ferrer K, Kruessel J, Fazleabas AT, Fisher SJ, Giudice LC. Decidual stromal cell response to paracrine signals from the trophoblast: amplification of immune and angiogenic modulators. Biol Reprod 2006; 76:102-17. [PMID: 17021345 DOI: 10.1095/biolreprod.106.054791] [Citation(s) in RCA: 216] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022] Open
Abstract
During the invasive phase of implantation, trophoblasts and maternal decidual stromal cells secrete products that regulate trophoblast differentiation and migration into the maternal endometrium. Paracrine interactions between the extravillous trophoblast and the maternal decidua are important for successful embryonic implantation, including establishing the placental vasculature, anchoring the placenta to the uterine wall, and promoting the immunoacceptance of the fetal allograph. To our knowledge, global crosstalk between the trophoblast and the decidua has not been elucidated to date, and the present study used a functional genomics approach to investigate these paracrine interactions. Human endometrial stromal cells were decidualized with progesterone and further treated with conditioned media from human trophoblasts (TCM) or, as a control, with control conditioned media (CCM) from nondecidualized stromal cells for 0, 3, and 12 h. Total RNA was isolated and processed for analysis on whole-genome, high-density oligonucleotide arrays containing 54,600 genes. We found that 1374 genes were significantly upregulated and that 3443 genes were significantly downregulated after 12 h of coincubation of stromal cells with TCM, compared to CCM. Among the most upregulated genes were the chemokines CXCL1 (GRO1) and IL8,CXCR4, and other genes involved in the immune response (CCL8 [SCYA8], pentraxin 3 (PTX3), IL6, and interferon-regulated and -related genes) as well as TNFAIP6 (tumor necrosis factor alpha-induced protein 6) and metalloproteinases (MMP1, MMP10, and MMP14). Among the downregulated genes were growth factors, e.g., IGF1, FGF1, TGFB1, and angiopoietin-1, and genes involved in Wnt signaling (WNT4 and FZD). Real-time RT-PCR and ELISAs, as well as immunohistochemical analysis of human placental bed specimens, confirmed these data for representative genes of both up- and downregulated groups. The data demonstrate a significant induction of proinflammatory cytokines and chemokines, as well as angiogenic/static factors in decidualized endometrial stromal cells in response to trophoblast-secreted products. The data suggest that the trophoblast acts to alter the local immune environment of the decidua to facilitate the process of implantation and ensure an enriched cytokine/chemokine environment while limiting the mitotic activity of the stromal cells during the invasive phase of implantation.
Collapse
Affiliation(s)
- A P Hess
- Department of Obstetrics, Gynecology, and Reproductive Sciences, University of California, San Francisco, California 94143-0132, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Abstract
Understanding individual response to a drug -what determines its efficacy and tolerability -is the major bottleneck in current drug development and clinical trials. Intracellular response and metabolism, for example through cytochrome P-450 enzymes, may either enhance or decrease the effect of different drugs, dependent on the genetic variant. Microarrays offer the potential to screen the genetic composition of the individual patient However, experiments are «noisy» and must be accompanied by solid and robust data analysis. Furthermore, recent research aims at the combination of high-throughput data with methods of mathematical modeling, enabling problem-oriented assistance in the drug discovery process. This article will discuss state-of-the-art DNA array technology platforms and the basic elements of data analysis and bioinformatics research in drug discovery. Enhancing single-gene analysis, we will present a new method for interpreting gene expression changes in the context of entire pathways. Furthermore, we will introduce the concept of systems biology as a new paradigm for drug development and highlight our recent research - the development of a modeling and simulation platform for biomedical applications. We discuss the potentials of systems biology for modeling the drug response of the individual patient.
Collapse
Affiliation(s)
- Ralf Herwig
- Max Planck Institute for Molecular Genetics, Department of Vertebrate Genomics, Berlin, Germany.
| | | |
Collapse
|
40
|
Jiang D, Pei J, Ramanathan M, Lin C, Tang C, Zhang A. Mining gene–sample–time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 2006. [DOI: 10.1007/s10115-006-0031-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
41
|
Abstract
The study of gene expression profiling of cells and tissue has become a major tool for discovery in medicine. Microarray experiments allow description of genome-wide expression changes in health and disease. The results of such experiments are expected to change the methods employed in the diagnosis and prognosis of disease in obstetrics and gynecology. Moreover, an unbiased and systematic study of gene expression profiling should allow the establishment of a new taxonomy of disease for obstetric and gynecologic syndromes. Thus, a new era is emerging in which reproductive processes and disorders could be characterized using molecular tools and fingerprinting. The design, analysis, and interpretation of microarray experiments require specialized knowledge that is not part of the standard curriculum of our discipline. This article describes the types of studies that can be conducted with microarray experiments (class comparison, class prediction, class discovery). We discuss key issues pertaining to experimental design, data preprocessing, and gene selection methods. Common types of data representation are illustrated. Potential pitfalls in the interpretation of microarray experiments, as well as the strengths and limitations of this technology, are highlighted. This article is intended to assist clinicians in appraising the quality of the scientific evidence now reported in the obstetric and gynecologic literature.
Collapse
Affiliation(s)
- Adi L. Tarca
- Perinatology Research Branch, National Institute of Child Health and Human Development, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, and Detroit, MI
- Department of Computer Science, Wayne State University
| | - Roberto Romero
- Perinatology Research Branch, National Institute of Child Health and Human Development, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, and Detroit, MI
- Center for Molecular Medicine and Genetics, Wayne State University
| | - Sorin Draghici
- Department of Computer Science, Wayne State University
- Karmanos Cancer Institute, Detroit, MI
| |
Collapse
|
42
|
Fang Z, Yang J, Li Y, Luo Q, Liu L. Knowledge guided analysis of microarray data. J Biomed Inform 2006; 39:401-11. [PMID: 16214421 DOI: 10.1016/j.jbi.2005.08.004] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2005] [Revised: 08/09/2005] [Accepted: 08/15/2005] [Indexed: 10/25/2022]
Abstract
To microarray expression data analysis, it is well accepted that biological knowledge-guided clustering techniques show more advantages than pure mathematical techniques. In this paper, Gene Ontology is introduced to guide the clustering process, and thus a new algorithm capturing both expression pattern similarities and biological function similarities is developed. Our algorithm was validated on two well-known public data sets and the results were compared with some previous works. It is shown that our method has advantages in both the quality of clusters and the precision of biological annotations. Furthermore, the clustering results can be adjusted according to different stringency requirements. It is expected that our algorithm can be extended to other biological knowledge, for example, metabolic networks.
Collapse
Affiliation(s)
- Zhuo Fang
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei 430074, PR China
| | | | | | | | | |
Collapse
|
43
|
Fang Z, Liu L, Yang J, Luo QM, Li YX. Comparisons of graph-structure clustering methods for gene expression data. Acta Biochim Biophys Sin (Shanghai) 2006; 38:379-84. [PMID: 16761095 DOI: 10.1111/j.1745-7270.2006.00175.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Although many numerical clustering algorithms have been applied to gene expression data analysis, the essential step is still biological interpretation by manual inspection. The correlation between genetic co-regulation and affiliation to a common biological process is what biologists expect. Here, we introduce some clustering algorithms that are based on graph structure constituted by biological knowledge. After applying a widely used dataset, we compared the result clusters of two of these algorithms in terms of the homogeneity of clusters and coherence of annotation and matching ratio. The results show that the clusters of knowledge-guided analysis are the kernel parts of the clusters of Gene Ontology (GO)-Cluster software, which contains the genes that are most expression correlative and most consistent with biological functions. Moreover, knowledge-guided analysis seems much more applicable than GO-Cluster in a larger dataset.
Collapse
Affiliation(s)
- Zhuo Fang
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology,Huazhong University of Science and Technology, Wuhan 430074, China
| | | | | | | | | |
Collapse
|
44
|
Talbi S, Hamilton AE, Vo KC, Tulac S, Overgaard MT, Dosiou C, Le Shay N, Nezhat CN, Kempson R, Lessey BA, Nayak NR, Giudice LC. Molecular phenotyping of human endometrium distinguishes menstrual cycle phases and underlying biological processes in normo-ovulatory women. Endocrinology 2006; 147:1097-121. [PMID: 16306079 DOI: 10.1210/en.2005-1076] [Citation(s) in RCA: 417] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Histological evaluation of endometrium has been the gold standard for clinical diagnosis and management of women with endometrial disorders. However, several recent studies have questioned the accuracy and utility of such evaluation, mainly because of significant intra- and interobserver variations in histological interpretation. To examine the possibility that biochemical or molecular signatures of endometrium may prove to be more useful, we have investigated whole-genome molecular phenotyping (54,600 genes and expressed sequence tags) of this tissue sampled across the cycle in 28 normo-ovulatory women, using high-density oligonucleotide microarrays. Unsupervised principal component analysis of all samples revealed that samples self-cluster into four groups consistent with histological phenotypes of proliferative (PE), early-secretory (ESE), mid-secretory (MSE), and late-secretory (LSE) endometrium. Independent hierarchical clustering analysis revealed equivalent results, with two major dendrogram branches corresponding to PE/ESE and MSE/LSE and sub-branching into the four respective phases with heterogeneity among samples within each sub-branch. K-means clustering of genes revealed four major patterns of gene expression (high in PE, high in ESE, high in MSE, and high in LSE), and gene ontology analysis of these clusters demonstrated cycle-phase-specific biological processes and molecular functions. Six samples with ambiguous histology were identically assignable to a cycle phase by both principal component analysis and hierarchical clustering. Additionally, pairwise comparisons of relative gene expression across the cycle revealed genes/families that clearly distinguish the transitions of PE-->ESE, ESE-->MSE, and MSE-->LSE, including receptomes and signaling pathways. Select genes were validated by quantitative RT-PCR. Overall, the results demonstrate that endometrial samples obtained by two different sampling techniques (biopsy and curetting hysterectomy specimens) from subjects who are as normal as possible in a human study and including those with unknown histology, can be classified by their molecular signatures and correspond to known phases of the menstrual cycle with identical results using two independent analytical methods. Also, the results enable global identification of biological processes and molecular mechanisms that occur dynamically in the endometrium in the changing steroid hormone milieu across the menstrual cycle in normo-ovulatory women. The results underscore the potential of gene expression profiling for developing molecular diagnostics of endometrial normalcy and abnormalities and identifying molecular targets for therapeutic purposes in endometrial disorders.
Collapse
Affiliation(s)
- S Talbi
- Department of Obstetrics, Gynecology, and Reproductive Sciences, University of California, San Francisco, Parnassus, M1495, Box 0132, San Francisco, California 94143-0132, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
Yang C, Zeng E, Li T, Narasimhan G. Clustering genes using gene expression and text literature data. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:329-40. [PMID: 16447990 DOI: 10.1109/csb.2005.23] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Clustering of gene expression data is a standard technique used to identify closely related genes. In this paper, we develop a new clustering algorithm, MSC (Multi-Source Clustering), to perform exploratory analysis using two or more diverse sources of data. In particular, we investigate the problem of improving the clustering by integrating information obtained from gene expression data with knowledge extracted from biomedical text literature. In each iteration of algorithm MSC, an EM-type procedure is employed to bootstrap the model obtained from one data source by starting with the cluster assignments obtained in the previous iteration using the other data sources. Upon convergence, the two individual models are used to construct the final cluster assignment. We compare the results of algorithm MSC for two data sources with the results obtained when the clustering is applied on the two sources of data separately. We also compare it with that obtained using the feature level integration method that performs the clustering after simply concatenating the features obtained from the two data sources. We show that the z-scores of the clustering results from MSC are better than that from the other methods. To evaluate our clusters better, function enrichment results are presented using terms from the Gene Ontology database. Finally, by investigating the success of motif detection programs that use the clusters, we show that our approach integrating gene expression data and text data reveals clusters that are biologically more meaningful than those identified using gene expression data alone.
Collapse
Affiliation(s)
- Chengyong Yang
- Bioinformatics Research Group, School of Computer Science, Florida International University, Miami, FL 33199, USA.
| | | | | | | |
Collapse
|
46
|
Herwig R, Lehrach H. Expression profiling of drug response--from genes to pathways. DIALOGUES IN CLINICAL NEUROSCIENCE 2006; 8:283-93. [PMID: 17117610 PMCID: PMC3181826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Understanding individual response to a drug-what determines its efficacy and tolerability-is the major bottleneck in current drug development and clinical trials. Intracellular response and metabolism, for example through cytochrome P-450 enzymes, may either enhance or decrease the effect of different drugs, dependent on the genetic variant. Microarrays offer the potential to screen the genetic composition of the individual patient. However, experiments are "noisy" and must be accompanied by solid and robust data analysis. Furthermore, recent research aims at the combination of high-throughput data with methods of mathematical modeling, enabling problem-oriented assistance in the drug discovery process. This article will discuss state-of-the-art DNA array technology platforms and the basic elements of data analysis and bioinformatics research in drug discovery. Enhancing single-gene analysis, we will present a new method for interpreting gene expression changes in the context of entire pathways. Furthermore, we will introduce the concept of systems biology as a new paradigm for drug development and highlight our recent research-the development of a modeling and simulation platform for biomedical applications. We discuss the potentials of systems biology for modeling the drug response of the individual patient.
Collapse
Affiliation(s)
- Ralf Herwig
- Max Planck Institute for Molecular Genetics, Department of Vertebrate Genomics, Berlin, Germany.
| | | |
Collapse
|
47
|
Wolski WE, Lalowski M, Martus P, Herwig R, Giavalisco P, Gobom J, Sickmann A, Lehrach H, Reinert K. Transformation and other factors of the peptide mass spectrometry pairwise peak-list comparison process. BMC Bioinformatics 2005; 6:285. [PMID: 16318636 PMCID: PMC1343595 DOI: 10.1186/1471-2105-6-285] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2005] [Accepted: 11/30/2005] [Indexed: 11/22/2022] Open
Abstract
Background: Biological Mass Spectrometry is used to analyse peptides and proteins. A mass spectrum generates a list of measured mass to charge ratios and intensities of ionised peptides, which is called a peak-list. In order to classify the underlying amino acid sequence, the acquired spectra are usually compared with synthetic ones. Development of suitable methods of direct peak-list comparison may be advantageous for many applications. Results: The pairwise peak-list comparison is a multistage process composed of matching of peaks embedded in two peak-lists, normalisation, scaling of peak intensities and dissimilarity measures. In our analysis, we focused on binary and intensity based measures. We have modified the measures in order to comprise the mass spectrometry specific properties of mass measurement accuracy and non-matching peaks. We compared the labelling of peak-list pairs, obtained using different factors of the pairwise peak-list comparison, as being the same or different to those determined by sequence database searches. In order to elucidate how these factors influence the peak-list comparison we adopted an analysis of variance type method with the partial area under the ROC curve as a dependent variable. Conclusion: The analysis of variance provides insight into the relevance of various factors influencing the outcome of the pairwise peak-list comparison. For large MS/MS and PMF data sets the outcome of ANOVA analysis was consistent, providing a strong indication that the results presented here might be valid for many various types of peptide mass measurements.
Collapse
Affiliation(s)
- Witold E Wolski
- Max Planck Institute for Molecular Genetics, Ihnestraβe 63-73, D-14195 Berlin, Germany
- School of Mathematics and Statistics, Merz Court, University of Newcastle upon Tyne, NE1 7RU, UK
| | - Maciej Lalowski
- Max Delbrück Center for Molecular Medicine, Robert-Roessle-Str. 10, D-13125 Berlin-Buch, Germany
| | - Peter Martus
- Institute for Medical Informatics, Biometry and Epidemiology; Charite University Medicine Berlin, Hindenburgdamm 30 (HBD 30), 12200 Berlin
| | - Ralf Herwig
- Max Planck Institute for Molecular Genetics, Ihnestraβe 63-73, D-14195 Berlin, Germany
| | - Patrick Giavalisco
- Boyce Thompson Institute for Plant Research, Tower Road, Ithaca 14850, NY, USA
| | - Johan Gobom
- Max Planck Institute for Molecular Genetics, Ihnestraβe 63-73, D-14195 Berlin, Germany
| | - Albert Sickmann
- DFG Research Center for Experimental Biomedicine, University of Würzburg, Versbacherstr. 9, D-97078 Würzburg, Germany
| | - Hans Lehrach
- Max Planck Institute for Molecular Genetics, Ihnestraβe 63-73, D-14195 Berlin, Germany
| | - Knut Reinert
- Institute for Computer Science, Free University Berlin, Takustr. 9, D-14195 Berlin, Germany
| |
Collapse
|
48
|
Gupta A, Maranas CD, Albert R. Elucidation of directionality for co-expressed genes: predicting intra-operon termination sites. Bioinformatics 2005; 22:209-14. [PMID: 16287937 DOI: 10.1093/bioinformatics/bti780] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In this paper, we present a novel framework for inferring regulatory and sequence-level information from gene co-expression networks. The key idea of our methodology is the systematic integration of network inference and network topological analysis approaches for uncovering biological insights. RESULTS We determine the gene co-expression network of Bacillus subtilis using Affymetrix GeneChip time-series data and show how the inferred network topology can be linked to sequence-level information hard-wired in the organism's genome. We propose a systematic way for determining the correlation threshold at which two genes are assessed to be co-expressed using the clustering coefficient and we expand the scope of the gene co-expression network by proposing the slope ratio metric as a means for incorporating directionality on the edges. We show through specific examples for B. subtilis that by incorporating expression level information in addition to the temporal expression patterns, we can uncover sequence-level biological insights. In particular, we are able to identify a number of cases where (1) the co-expressed genes are part of a single transcriptional unit or operon and (2) the inferred directionality arises due to the presence of intra-operon transcription termination sites. AVAILABILITY The software will be provided on request. SUPPLEMENTARY INFORMATION http://www.phys.psu.edu/~ralbert/pdf/gma_bioinf_supp.pdf
Collapse
Affiliation(s)
- Anshuman Gupta
- Academic Services and Emerging Technologies, The Pennsylvania State University University Park, PA, USA
| | | | | |
Collapse
|
49
|
Yoo C, Cooper GF, Schmidt M. A control study to evaluate a computer-based microarray experiment design recommendation system for gene-regulation pathways discovery. J Biomed Inform 2005; 39:126-46. [PMID: 16203178 DOI: 10.1016/j.jbi.2005.05.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2005] [Revised: 04/22/2005] [Accepted: 05/27/2005] [Indexed: 11/22/2022]
Abstract
The main topic of this paper is evaluating a system that uses the expected value of experimentation for discovering causal pathways in gene expression data. By experimentation we mean both interventions (e.g., a gene knock-out experiment) and observations (e.g., passively observing the expression level of a "wild-type" gene). We introduce a system called GEEVE (causal discovery in Gene Expression data using Expected Value of Experimentation), which implements expected value of experimentation in discovering causal pathways using gene expression data. GEEVE provides the following assistance, which is intended to help biologists in their quest to discover gene-regulation pathways: Recommending which experiments to perform (with a focus on "knock-out" experiments) using an expected value of experimentation (EVE) method. Recommending the number of measurements (observational and experimental) to include in the experimental design, again using an EVE method. Providing a Bayesian analysis that combines prior knowledge with the results of recent microarray experimental results to derive posterior probabilities of gene regulation relationships. In recommending which experiments to perform (and how many times to repeat them) the EVE approach considers the biologist's preferences for which genes to focus the discovery process. Also, since exact EVE calculations are exponential in time, GEEVE incorporates approximation methods. GEEVE is able to combine data from knock-out experiments with data from wild-type experiments to suggest additional experiments to perform and then to analyze the results of those microarray experimental results. It models the possibility that unmeasured (latent) variables may be responsible for some of the statistical associations among the expression levels of the genes under study. To evaluate the GEEVE system, we used a gene expression simulator to generate data from specified models of gene regulation. Using the simulator, we evaluated the GEEVE system using a randomized control study that involved 10 biologists, some of whom used GEEVE and some of whom did not. The results show that biologists who used GEEVE reached correct causal assessments about gene regulation more often than did those biologists who did not use GEEVE. The GEEVE users also reached their assessments in a more cost-effective manner.
Collapse
Affiliation(s)
- Changwon Yoo
- Department of Computer Science, University of Montana, 420 Social Sciences, University of Montana, Missoula, MT 59803, USA.
| | | | | |
Collapse
|
50
|
Abstract
The advent of microarray technology has revolutionized the search for genes that are differentially expressed across a range of cell types or experimental conditions. Traditional clustering methods, such as hierarchical clustering, are often difficult to deploy effectively since genes rarely exhibit similar expression pattern across a wide range of conditions. Biclustering of gene expression data (also called co-clustering or two-way clustering) is a non-trivial but promising methodology for the identification of gene groups that show a coherent expression profile across a subset of conditions. Thus, biclustering is a natural methodology as a screen for genes that are functionally related, participate in the same pathways, affected by the same drug or pathological condition, or genes that form modules that are potentially co-regulated by a small group of transcription factors. We have developed a web-enabled service called GEMS (Gene Expression Mining Server) for biclustering microarray data. Users may upload expression data and specify a set of criteria. GEMS then performs bicluster mining based on a Gibbs sampling paradigm. The web server provides a flexible and an useful platform for the discovery of co-expressed and potentially co-regulated gene modules. GEMS is an open source software and is available at .
Collapse
Affiliation(s)
- Chang-Jiun Wu
- Program in Bioinformatics, Boston UniversityBoston, MA 02215, USA
| | - Simon Kasif
- Program in Bioinformatics, Boston UniversityBoston, MA 02215, USA
- Department of Biomedical Engineering, Boston UniversityBoston, MA 02215, USA
- To whom correspondence should be addressed. Tel: +1 617 358 1845; Fax: +1 617 353 6766;
| |
Collapse
|