1
|
Jaroszewicz A, Ernst J. ChromGene: gene-based modeling of epigenomic data. Genome Biol 2023; 24:203. [PMID: 37679846 PMCID: PMC10486095 DOI: 10.1186/s13059-023-03041-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 08/21/2023] [Indexed: 09/09/2023] Open
Abstract
Various computational approaches have been developed to annotate epigenomes on a per-position basis by modeling combinatorial and spatial patterns within epigenomic data. However, such annotations are less suitable for gene-based analyses. We present ChromGene, a method based on a mixture of learned hidden Markov models, to annotate genes based on multiple epigenomic maps across the gene body and flanks. We provide ChromGene assignments for over 100 cell and tissue types. We characterize the mixture components in terms of gene expression, constraint, and other gene annotations. The ChromGene method and annotations will provide a useful resource for gene-based epigenomic analyses.
Collapse
Affiliation(s)
- Artur Jaroszewicz
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Jason Ernst
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Computational Medicine Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
| |
Collapse
|
2
|
Nyamundanda G, Eason K, Guinney J, Lord CJ, Sadanandam A. A Machine-Learning Tool Concurrently Models Single Omics and Phenome Data for Functional Subtyping and Personalized Cancer Medicine. Cancers (Basel) 2020; 12:E2811. [PMID: 33007815 PMCID: PMC7601761 DOI: 10.3390/cancers12102811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 09/22/2020] [Accepted: 09/25/2020] [Indexed: 11/29/2022] Open
Abstract
One of the major challenges in defining clinically-relevant and less heterogeneous tumor subtypes is assigning biological and/or clinical interpretations to etiological (intrinsic) subtypes. Conventional clustering/subtyping approaches often fail to define such subtypes, as they involve several discrete steps. Here we demonstrate a unique machine-learning method, phenotype mapping (PhenMap), which jointly integrates single omics data with phenotypic information using three published breast cancer datasets (n = 2045). The PhenMap framework uses a modified factor analysis method that is governed by a key assumption that, features from different omics data types are correlated due to specific "hidden/mapping" variables (context-specific mapping variables (CMV)). These variables can be simultaneously modeled with phenotypic data as covariates to yield functional subtypes and their associated features (e.g., genes) and phenotypes. In one example, we demonstrate the identification and validation of six novel "functional" (discrete) subtypes with differential responses to a cyclin-dependent kinase (CDK)4/6 inhibitor and etoposide by jointly integrating transcriptome profiles with four different drug response data from 37 breast cancer cell lines. These robust subtypes are also present in patient breast tumors with different prognosis. In another example, we modeled patient gene expression profiles and clinical covariates together to identify continuous subtypes with clinical/biological implications. Overall, this genome-phenome machine-learning integration tool, PhenMap identifies functional and phenotype-integrated discrete or continuous subtypes with clinical translational potential.
Collapse
Affiliation(s)
- Gift Nyamundanda
- Division of Molecular Pathology, The Institute of Cancer Research, London SW3 6JB, UK; (G.N.); (K.E.)
| | - Katherine Eason
- Division of Molecular Pathology, The Institute of Cancer Research, London SW3 6JB, UK; (G.N.); (K.E.)
| | | | - Christopher J. Lord
- The Breast Cancer Now Toby Robins Research Centre, The Institute of Cancer Research, London SW3 6JB, UK;
| | - Anguraj Sadanandam
- Division of Molecular Pathology, The Institute of Cancer Research, London SW3 6JB, UK; (G.N.); (K.E.)
| |
Collapse
|
3
|
Di Serio C, Scala S, Vicard P. Bayesian networks for cell differentiation process assessment. Stat (Int Stat Inst) 2020. [DOI: 10.1002/sta4.287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Clelia Di Serio
- University Centre for Statistics in the Biomedical Sciences Vita‐Salute San Raffaele University Milan 20132 Italy
| | - Serena Scala
- San Raffaele Telethon Institute for Gene Therapy (TIGET) Milan 20132 Italy
| | - Paola Vicard
- Department of Economics University Roma Tre Rome 00154 Italy
| |
Collapse
|
4
|
Rudd J, Zelaya RA, Demidenko E, Goode EL, Greene CS, Doherty JA. Leveraging global gene expression patterns to predict expression of unmeasured genes. BMC Genomics 2015; 16:1065. [PMID: 26666289 PMCID: PMC4678722 DOI: 10.1186/s12864-015-2250-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 11/27/2015] [Indexed: 12/31/2022] Open
Abstract
Background Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. Results We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|rP|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1–3), and |rP| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |rP|. Conclusions GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2250-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- James Rudd
- Department of Epidemiology, Geisel School of Medicine at Dartmouth College, One Medical Center Drive, 7927 Rubin Building, Lebanon, NH, 03756, USA.
| | - René A Zelaya
- Department of Genetics, Geisel School of Medicine at Dartmouth College; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, 10-131 SCTR, 34th & Civic Center Boulevard, Philadelphia, PA, 19104-5158, USA.
| | - Eugene Demidenko
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth College, One Medical Center Drive, 7927 Rubin Building, Lebanon, NH, 03756, USA.
| | - Ellen L Goode
- Department of Health Sciences Research, Division of Epidemiology, Mayo Clinic, 200 First St. SW, Rochester, MN, 55905, USA.
| | - Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth College; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, 10-131 SCTR, 34th & Civic Center Boulevard, Philadelphia, PA, 19104-5158, USA.
| | - Jennifer A Doherty
- Department of Epidemiology, Geisel School of Medicine at Dartmouth College, One Medical Center Drive, 7927 Rubin Building, Lebanon, NH, 03756, USA.
| |
Collapse
|
5
|
Jaskowiak PA, Campello RJGB, Costa IG. On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics 2014; 15 Suppl 2:S2. [PMID: 24564555 PMCID: PMC4072854 DOI: 10.1186/1471-2105-15-s2-s2] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. RESULTS AND CONCLUSIONS We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.
Collapse
Affiliation(s)
- Pablo A Jaskowiak
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos - SP, Brazil
| | - Ricardo JGB Campello
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos - SP, Brazil
| | - Ivan G Costa
- Center of Informatics, Federal University of Pernambuco, Recife - PE, Brazil
- IZKF Computational Biology Research Group, Institute for Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany
| |
Collapse
|
6
|
Armond JW, Saha K, Rana AA, Oates CJ, Jaenisch R, Nicodemi M, Mukherjee S. A stochastic model dissects cell states in biological transition processes. Sci Rep 2014; 4:3692. [PMID: 24435049 PMCID: PMC3894565 DOI: 10.1038/srep03692] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2013] [Accepted: 12/03/2013] [Indexed: 11/09/2022] Open
Abstract
Many biological processes, including differentiation, reprogramming, and disease transformations, involve transitions of cells through distinct states. Direct, unbiased investigation of cell states and their transitions is challenging due to several factors, including limitations of single-cell assays. Here we present a stochastic model of cellular transitions that allows underlying single-cell information, including cell-state-specific parameters and rates governing transitions between states, to be estimated from genome-wide, population-averaged time-course data. The key novelty of our approach lies in specifying latent stochastic models at the single-cell level, and then aggregating these models to give a likelihood that links parameters at the single-cell level to observables at the population level. We apply our approach in the context of reprogramming to pluripotency. This yields new insights, including profiles of two intermediate cell states, that are supported by independent single-cell studies. Our model provides a general conceptual framework for the study of cell transitions, including epigenetic transformations.
Collapse
Affiliation(s)
| | - Krishanu Saha
- Department of Biomedical Engineering, University of Wisconsin-Madison, Madison, WI, USA
| | - Anas A Rana
- 1] Centre for Complexity Science, University of Warwick, Coventry, UK [2] Division of Biochemistry, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Chris J Oates
- 1] Centre for Complexity Science, University of Warwick, Coventry, UK [2] Division of Biochemistry, The Netherlands Cancer Institute, Amsterdam, The Netherlands [3] Department of Statistics, University of Warwick, Coventry, UK
| | - Rudolf Jaenisch
- 1] The Whitehead Institute for Biomedical Research, Massachusetts Institute of Technology, Cambridge, MA, USA [2] Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Mario Nicodemi
- Dip.to di Scienze Fisiche, Univ. di Napoli "Federico II", INFN Napoli, Italy
| | - Sach Mukherjee
- Division of Biochemistry, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| |
Collapse
|
7
|
Schulz MH, Devanny WE, Gitter A, Zhong S, Ernst J, Bar-Joseph Z. DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data. BMC SYSTEMS BIOLOGY 2012; 6:104. [PMID: 22897824 PMCID: PMC3464930 DOI: 10.1186/1752-0509-6-104] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2012] [Accepted: 07/18/2012] [Indexed: 12/28/2022]
Abstract
Background Modeling dynamic regulatory networks is a major challenge since much of the protein-DNA interaction data available is static. The Dynamic Regulatory Events Miner (DREM) uses a Hidden Markov Model-based approach to integrate this static interaction data with time series gene expression leading to models that can determine when transcription factors (TFs) activate genes and what genes they regulate. DREM has been used successfully in diverse areas of biological research. However, several issues were not addressed by the original version. Results DREM 2.0 is a comprehensive software for reconstructing dynamic regulatory networks that supports interactive graphical or batch mode. With version 2.0 a set of new features that are unique in comparison with other softwares are introduced. First, we provide static interaction data for additional species. Second, DREM 2.0 now accepts continuous binding values and we added a new method to utilize TF expression levels when searching for dynamic models. Third, we added support for discriminative motif discovery, which is particularly powerful for species with limited experimental interaction data. Finally, we improved the visualization to support the new features. Combined, these changes improve the ability of DREM 2.0 to accurately recover dynamic regulatory networks and make it much easier to use it for analyzing such networks in several species with varying degrees of interaction information. Conclusions DREM 2.0 provides a unique framework for constructing and visualizing dynamic regulatory networks. DREM 2.0 can be downloaded from: www.sb.cs.cmu.edu/drem.
Collapse
Affiliation(s)
- Marcel H Schulz
- Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA.
| | | | | | | | | | | |
Collapse
|
8
|
Li W, Zhang S, Liu CC, Zhou XJ. Identifying multi-layer gene regulatory modules from multi-dimensional genomic data. Bioinformatics 2012; 28:2458-66. [PMID: 22863767 PMCID: PMC3463121 DOI: 10.1093/bioinformatics/bts476] [Citation(s) in RCA: 92] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Motivation: Eukaryotic gene expression (GE) is subjected to precisely coordinated multi-layer controls, across the levels of epigenetic, transcriptional and post-transcriptional regulations. Recently, the emerging multi-dimensional genomic dataset has provided unprecedented opportunities to study the cross-layer regulatory interplay. In these datasets, the same set of samples is profiled on several layers of genomic activities, e.g. copy number variation (CNV), DNA methylation (DM), GE and microRNA expression (ME). However, suitable analysis methods for such data are currently sparse. Results: In this article, we introduced a sparse Multi-Block Partial Least Squares (sMBPLS) regression method to identify multi-dimensional regulatory modules from this new type of data. A multi-dimensional regulatory module contains sets of regulatory factors from different layers that are likely to jointly contribute to a local ‘gene expression factory’. We demonstrated the performance of our method on the simulated data as well as on The Cancer Genomic Atlas Ovarian Cancer datasets including the CNV, DM, ME and GE data measured on 230 samples. We showed that majority of identified modules have significant functional and transcriptional enrichment, higher than that observed in modules identified using only a single type of genomic data. Our network analysis of the modules revealed that the CNV, DM and microRNA can have coupled impact on expression of important oncogenes and tumor suppressor genes. Availability and implementation: The source code implemented by MATLAB is freely available at: http://zhoulab.usc.edu/sMBPLS/. Contact:xjzhou@usc.edu Supplementary information:Supplementary material are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenyuan Li
- Program in Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | | | | | | |
Collapse
|
9
|
Hashimoto T, Jaakkola T, Sherwood R, Mazzoni EO, Wichterle H, Gifford D. Lineage-based identification of cellular states and expression programs. Bioinformatics 2012; 28:i250-7. [PMID: 22689769 PMCID: PMC3371836 DOI: 10.1093/bioinformatics/bts204] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
We present a method, LineageProgram, that uses the developmental lineage relationship of observed gene expression measurements to improve the learning of developmentally relevant cellular states and expression programs. We find that incorporating lineage information allows us to significantly improve both the predictive power and interpretability of expression programs that are derived from expression measurements from in vitro differentiation experiments. The lineage tree of a differentiation experiment is a tree graph whose nodes describe all of the unique expression states in the input expression measurements, and edges describe the experimental perturbations applied to cells. Our method, LineageProgram, is based on a log-linear model with parameters that reflect changes along the lineage tree. Regularization with L(1) that based methods controls the parameters in three distinct ways: the number of genes change between two cellular states, the number of unique cellular states, and the number of underlying factors responsible for changes in cell state. The model is estimated with proximal operators to quickly discover a small number of key cell states and gene sets. Comparisons with existing factorization, techniques, such as singular value decomposition and non-negative matrix factorization show that our method provides higher predictive power in held, out tests while inducing sparse and biologically relevant gene sets.
Collapse
Affiliation(s)
- Tatsunori Hashimoto
- Department of Computer Science and Electrical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | | | | | | | | |
Collapse
|
10
|
An ensemble approach for inferring semi-quantitative regulatory dynamics for the differentiation of mouse embryonic stem cells using prior knowledge. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2012; 736:247-60. [PMID: 22161333 DOI: 10.1007/978-1-4419-7210-1_14] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The process of differentiation of embryonic stem cells (ESCs) is currently becoming the focus of many systems biologists not only due to mechanistic interest but also since it is expected to play an increasingly important role in regenerative medicine, in particular with the advert to induced pluripotent stem cells. These ESCs give rise to the formation of the three germ layers and therefore to the formation of all tissues and organs. Here, we present a computational method for inferring regulatory interactions between the genes involved in ESC differentiation based on time resolved microarray profiles. Fully quantitative methods are commonly unavailable on such large-scale data; on the other hand, purely qualitative methods may fail to capture some of the more detailed regulations. Our method combines the beneficial aspects of qualitative and quantitative (ODE-based) modeling approaches searching for quantitative interaction coefficients in a discrete and qualitative state space. We further optimize on an ensemble of networks to detect essential properties and compare networks with respect to robustness. Applied to a toy model our method is able to reconstruct the original network and outperforms an entire discrete boolean approach. In particular, we show that including prior knowledge leads to more accurate results. Applied to data from differentiating mouse ESCs reveals new regulatory interactions, in particular we confirm the activation of Foxh1 through Oct4, mediating Nodal signaling.
Collapse
|
11
|
Zheng CH, Zhang L, Ng VTY, Shiu SCK, Huang DS. Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1592-1603. [PMID: 21519114 DOI: 10.1109/tcbb.2011.79] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
A reliable and precise identification of the type of tumors is crucial to the effective treatment of cancer. With the rapid development of microarray technologies, tumor clustering based on gene expression data is becoming a powerful approach to cancer class discovery. In this paper, we apply the penalized matrix decomposition (PMD) to gene expression data to extract metasamples for clustering. The extracted metasamples capture the inherent structures of samples belong to the same class. At the same time, the PMD factors of a sample over the metasamples can be used as its class indicator in return. Compared with the conventional methods such as hierarchical clustering (HC), self-organizing maps (SOM), affinity propagation (AP) and nonnegative matrix factorization (NMF), the proposed method can identify the samples with complex classes. Moreover, the factor of PMD can be used as an index to determine the cluster number. The proposed method provides a reasonable explanation of the inconsistent classifications made by the conventional methods. In addition, it is able to discover the modules in gene expression data of conterminous developmental stages. Experiments on two representative problems show that the proposed PMD-based method is very promising to discover biological phenotypes.
Collapse
Affiliation(s)
- Chun-Hou Zheng
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui 230039, China.
| | | | | | | | | |
Collapse
|
12
|
Costa IG, Roider HG, do Rego TG, de Carvalho FDAT. Predicting gene expression in T cell differentiation from histone modifications and transcription factor binding affinities by linear mixture models. BMC Bioinformatics 2011; 12 Suppl 1:S29. [PMID: 21342559 PMCID: PMC3044284 DOI: 10.1186/1471-2105-12-s1-s29] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND The differentiation process from stem cells to fully differentiated cell types is controlled by the interplay of chromatin modifications and transcription factor activity. Histone modifications or transcription factors frequently act in a multi-functional manner, with a given DNA motif or histone modification conveying both transcriptional repression and activation depending on its location in the promoter and other regulatory signals surrounding it. RESULTS To account for the possible multi functionality of regulatory signals, we model the observed gene expression patterns by a mixture of linear regression models. We apply the approach to identify the underlying histone modifications and transcription factors guiding gene expression of differentiated CD4+ T cells. The method improves the gene expression prediction in relation to the use of a single linear model, as often used by previous approaches. Moreover, it recovered the known role of the modifications H3K4me3 and H3K27me3 in activating cell specific genes and of some transcription factors related to CD4+ T differentiation.
Collapse
Affiliation(s)
- Ivan G Costa
- Center of Informatics, Federal University of Pernambuco, Recife, Brazil.
| | | | | | | |
Collapse
|
13
|
Nyamundanda G, Brennan L, Gormley IC. Probabilistic principal component analysis for metabolomic data. BMC Bioinformatics 2010; 11:571. [PMID: 21092268 PMCID: PMC3006395 DOI: 10.1186/1471-2105-11-571] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2010] [Accepted: 11/23/2010] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model. RESULTS Here, probabilistic principal component analysis (PPCA) which addresses some of the limitations of PCA, is reviewed and extended. A novel extension of PPCA, called probabilistic principal component and covariates analysis (PPCCA), is introduced which provides a flexible approach to jointly model metabolomic data and additional covariate information. The use of a mixture of PPCA models for discovering the number of inherent groups in metabolomic data is demonstrated. The jackknife technique is employed to construct confidence intervals for estimated model parameters throughout. The optimal number of principal components is determined through the use of the Bayesian Information Criterion model selection tool, which is modified to address the high dimensionality of the data. CONCLUSIONS The methods presented are illustrated through an application to metabolomic data sets. Jointly modeling metabolomic data and covariates was successfully achieved and has the potential to provide deeper insight to the underlying data structure. Examination of confidence intervals for the model parameters, such as loadings, allows for principled and clear interpretation of the underlying data structure. A software package called MetabolAnalyze, freely available through the R statistical software, has been developed to facilitate implementation of the presented methods in the metabolomics field.
Collapse
Affiliation(s)
- Gift Nyamundanda
- School of Mathematical Sciences, University College Dublin, Ireland
| | - Lorraine Brennan
- School of Agriculture, Food Science and Veterinary Medicine, Conway Institute, University College Dublin, Ireland
| | | |
Collapse
|
14
|
Georgi B, Costa IG, Schliep A. PyMix--the python mixture package--a tool for clustering of heterogeneous biological data. BMC Bioinformatics 2010; 11:9. [PMID: 20053276 PMCID: PMC2823712 DOI: 10.1186/1471-2105-11-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2009] [Accepted: 01/06/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cluster analysis is an important technique for the exploratory analysis of biological data. Such data is often high-dimensional, inherently noisy and contains outliers. This makes clustering challenging. Mixtures are versatile and powerful statistical models which perform robustly for clustering in the presence of noise and have been successfully applied in a wide range of applications. RESULTS PyMix - the Python mixture package implements algorithms and data structures for clustering with basic and advanced mixture models. The advanced models include context-specific independence mixtures, mixtures of dependence trees and semi-supervised learning. PyMix is licenced under the GNU General Public licence (GPL). PyMix has been successfully used for the analysis of biological sequence, complex disease and gene expression data. CONCLUSIONS PyMix is a useful tool for cluster analysis of biological data. Due to the general nature of the framework, PyMix can be applied to a wide range of applications and data sets.
Collapse
Affiliation(s)
- Benjamin Georgi
- Max Planck Institute for Molecular Genetics, Dept, of Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin.
| | | | | |
Collapse
|
15
|
A neural network-based biomarker association information extraction approach for cancer classification. J Biomed Inform 2009; 42:654-66. [DOI: 10.1016/j.jbi.2008.12.010] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2008] [Revised: 12/18/2008] [Accepted: 12/19/2008] [Indexed: 11/24/2022]
|