1
|
CEA: Combination-based gene set functional enrichment analysis. Sci Rep 2018; 8:13085. [PMID: 30166636 PMCID: PMC6117355 DOI: 10.1038/s41598-018-31396-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Accepted: 08/10/2018] [Indexed: 02/08/2023] Open
Abstract
Functional enrichment analysis is a fundamental and challenging task in bioinformatics. Most of the current enrichment analysis approaches individually evaluate functional terms and often output a list of enriched terms with high similarity and redundancy, which makes it difficult for downstream studies to extract the underlying biological interpretation. In this paper, we proposed a novel framework to assess the performance of combination-based enrichment analysis. Using this framework, we formulated the enrichment analysis as a multi-objective combinatorial optimization problem and developed the CEA (Combination-based Enrichment Analysis) method. CEA provides the whole landscape of term combinations; therefore, it is a good benchmark for evaluating the current state-of-the-art combination-based functional enrichment methods in a comprehensive manner. We tested the effectiveness of CEA on four published microarray datasets. Enriched functional terms identified by CEA not only involve crucial biological processes of related diseases, but also have much less redundancy and can serve as a preferable representation for the enriched terms found by traditional single-term-based methods. CEA has been implemented in the R package CopTea and is available at http://github.com/wulingyun/CopTea/.
Collapse
|
2
|
Liang K, Du C, You H, Nettleton D. A hidden Markov tree model for testing multiple hypotheses corresponding to Gene Ontology gene sets. BMC Bioinformatics 2018; 19:107. [PMID: 29587646 PMCID: PMC5869792 DOI: 10.1186/s12859-018-2106-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 03/05/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Testing predefined gene categories has become a common practice for scientists analyzing high throughput transcriptome data. A systematic way of testing gene categories leads to testing hundreds of null hypotheses that correspond to nodes in a directed acyclic graph. The relationships among gene categories induce logical restrictions among the corresponding null hypotheses. An existing fully Bayesian method is powerful but computationally demanding. RESULTS We develop a computationally efficient method based on a hidden Markov tree model (HMTM). Our method is several orders of magnitude faster than the existing fully Bayesian method. Through simulation and an expression quantitative trait loci study, we show that the HMTM method provides more powerful results than other existing methods that honor the logical restrictions. CONCLUSIONS The HMTM method provides an individual estimate of posterior probability of being differentially expressed for each gene set, which can be useful for result interpretation. The R package can be found on https://github.com/k22liang/HMTGO .
Collapse
Affiliation(s)
- Kun Liang
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, N2L 3G1, Canada.
| | - Chuanlong Du
- Department of Statistics, Iowa State University, Ames, 50011, USA
| | - Hankun You
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, N2L 3G1, Canada
| | - Dan Nettleton
- Department of Statistics, Iowa State University, Ames, 50011, USA
| |
Collapse
|
3
|
Frost HR, Amos CI. Gene set selection via LASSO penalized regression (SLPR). Nucleic Acids Res 2017; 45:e114. [PMID: 28472344 PMCID: PMC5499546 DOI: 10.1093/nar/gkx291] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2016] [Accepted: 04/12/2017] [Indexed: 01/23/2023] Open
Abstract
Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets.
Collapse
Affiliation(s)
- H Robert Frost
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| |
Collapse
|
4
|
Morris JS, Baladandayuthapani V. Statistical Contributions to Bioinformatics: Design, Modeling, Structure Learning, and Integration. STAT MODEL 2017; 17:245-289. [PMID: 29129969 PMCID: PMC5679480 DOI: 10.1177/1471082x17698255] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The advent of high-throughput multi-platform genomics technologies providing whole-genome molecular summaries of biological samples has revolutionalized biomedical research. These technologiees yield highly structured big data, whose analysis poses significant quantitative challenges. The field of Bioinformatics has emerged to deal with these challenges, and is comprised of many quantitative and biological scientists working together to effectively process these data and extract the treasure trove of information they contain. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, focusing on four areas: (1) experimental design and reproducibility, (2) preprocessing and feature extraction, (3) unified modeling, and (4) structure learning and integration. In each of these areas, we highlight some key contributions and try to elucidate the key statistical principles underlying these methods and approaches. Our goals are to demonstrate major ways in which statisticians have contributed to bioinformatics, encourage statisticians to get involved early in methods development as new technologies emerge, and to stimulate future methodological work based on the statistical principles elucidated in this article and utilizing all availble information to uncover new biological insights.
Collapse
Affiliation(s)
- Jeffrey S Morris
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | | |
Collapse
|
5
|
Paul PK, Rabaglia ME, Wang CY, Stapleton DS, Leng N, Kendziorski C, Lewis PW, Keller MP, Attie AD. Histone chaperone ASF1B promotes human β-cell proliferation via recruitment of histone H3.3. Cell Cycle 2016; 15:3191-3202. [PMID: 27753532 DOI: 10.1080/15384101.2016.1241914] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Anti-silencing function 1 (ASF1) is a histone H3-H4 chaperone involved in DNA replication and repair, and transcriptional regulation. Here, we identify ASF1B, the mammalian paralog to ASF1, as a proliferation-inducing histone chaperone in human β-cells. Overexpression of ASF1B led to distinct transcriptional signatures consistent with increased cellular proliferation and reduced cellular death. Using multiple methods of monitoring proliferation and mitotic progression, we show that overexpression of ASF1B is sufficient to induce human β-cell proliferation. Co-expression of histone H3.3 further augmented β-cell proliferation, whereas suppression of endogenous H3.3 attenuated the stimulatory effect of ASF1B. Using the histone binding-deficient mutant of ASF1B (V94R), we show that histone binding to ASF1B is required for the induction of β-cell proliferation. In contrast to H3.3, overexpression of histone H3 variants H3.1 and H3.2 did not have an impact on ASF1B-mediated induction of proliferation. Our findings reveal a novel role of ASF1B in human β-cell replication and show that ASF1B and histone H3.3A synergistically stimulate human β-cell proliferation.
Collapse
Affiliation(s)
- Pradyut K Paul
- a Department of Biochemistry , University of Wisconsin , Madison , WI , USA
| | - Mary E Rabaglia
- a Department of Biochemistry , University of Wisconsin , Madison , WI , USA
| | - Chen-Yu Wang
- a Department of Biochemistry , University of Wisconsin , Madison , WI , USA
| | - Donald S Stapleton
- a Department of Biochemistry , University of Wisconsin , Madison , WI , USA
| | - Ning Leng
- b Department of Statistics , University of Wisconsin , Madison , WI , USA
| | - Christina Kendziorski
- c Department of Biostatistics and Medical Informatics , University of Wisconsin , Madison , WI , USA
| | - Peter W Lewis
- d Department of Biomolecular Chemistry , University of Wisconsin , Madison , WI , USA
| | - Mark P Keller
- a Department of Biochemistry , University of Wisconsin , Madison , WI , USA
| | - Alan D Attie
- a Department of Biochemistry , University of Wisconsin , Madison , WI , USA
| |
Collapse
|
6
|
Xiong L, Kuan PF, Tian J, Keles S, Wang S. Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data. Cancer Inform 2015; 13:123-31. [PMID: 26609213 PMCID: PMC4648611 DOI: 10.4137/cin.s16353] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/16/2015] [Accepted: 03/20/2015] [Indexed: 12/29/2022] Open
Abstract
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies.
Collapse
Affiliation(s)
- Lie Xiong
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Pei-Fen Kuan
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
| | - Jianan Tian
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Sunduz Keles
- Department of Statistics, University of Wisconsin, Madison, WI, USA. ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| | - Sijian Wang
- Department of Statistics, University of Wisconsin, Madison, WI, USA. ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
7
|
Pan W, Kwak IY, Wei P. A Powerful Pathway-Based Adaptive Test for Genetic Association with Common or Rare Variants. Am J Hum Genet 2015; 97:86-98. [PMID: 26119817 DOI: 10.1016/j.ajhg.2015.05.018] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 05/21/2015] [Indexed: 12/11/2022] Open
Abstract
In spite of the success of genome-wide association studies (GWASs), only a small proportion of heritability for each complex trait has been explained by identified genetic variants, mainly SNPs. Likely reasons include genetic heterogeneity (i.e., multiple causal genetic variants) and small effect sizes of causal variants, for which pathway analysis has been proposed as a promising alternative to the standard single-SNP-based analysis. A pathway contains a set of functionally related genes, each of which includes multiple SNPs. Here we propose a pathway-based test that is adaptive at both the gene and SNP levels, thus maintaining high power across a wide range of situations with varying numbers of the genes and SNPs associated with a trait. The proposed method is applicable to both common variants and rare variants and can incorporate biological knowledge on SNPs and genes to boost statistical power. We use extensively simulated data and a WTCCC GWAS dataset to compare our proposal with several existing pathway-based and SNP-set-based tests, demonstrating its promising performance and its potential use in practice.
Collapse
|
8
|
Abstract
An important data analysis task in statistical genomics involves the integration of genome-wide gene-level measurements with preexisting data on the same genes. A wide variety of statistical methodologies and computational tools have been developed for this general task. We emphasize one particular distinction among methodologies, namely whether they process gene sets one at a time (uniset) or simultaneously via some multiset technique. Owing to the complexity of collections of gene sets, the multiset approach offers some advantages, as it naturally accommodates set-size variations and among-set overlaps. However, this approach presents both computational and inferential challenges. After reviewing some statistical issues that arise in uniset analysis, we examine two model-based multiset methods for gene list data.
Collapse
Affiliation(s)
- Michael A Newton
- Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706 ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin 53706
| | - Zhishi Wang
- Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706
| |
Collapse
|
9
|
Wang Z, He Q, Larget B, Newton MA. A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis. Ann Appl Stat 2015. [DOI: 10.1214/14-aoas777] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Kristensen VN, Lingjærde OC, Russnes HG, Vollan HKM, Frigessi A, Børresen-Dale AL. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer 2014; 14:299-313. [PMID: 24759209 DOI: 10.1038/nrc3721] [Citation(s) in RCA: 235] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Combined analyses of molecular data, such as DNA copy-number alteration, mRNA and protein expression, point to biological functions and molecular pathways being deregulated in multiple cancers. Genomic, metabolomic and clinical data from various solid cancers and model systems are emerging and can be used to identify novel patient subgroups for tailored therapy and monitoring. The integrative genomics methodologies that are used to interpret these data require expertise in different disciplines, such as biology, medicine, mathematics, statistics and bioinformatics, and they can seem daunting. The objectives, methods and computational tools of integrative genomics that are available to date are reviewed here, as is their implementation in cancer research.
Collapse
Affiliation(s)
- Vessela N Kristensen
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [3] Department of Clinical Molecular Oncology, Division of Medicine, Akershus University Hospital, 1478 Ahus, Norway
| | - Ole Christian Lingjærde
- 1] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [2] Division for Biomedical Informatics, Department of Computer Science, University of Oslo, 0316 Oslo, Norway
| | - Hege G Russnes
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [3] Department of Pathology, Oslo University Hospital, 0450 Oslo, Norway
| | - Hans Kristian M Vollan
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [3] Department of Oncology, Division of Cancer, Surgery and Transplantation, Oslo University Hospital, 0450 Oslo, Norway
| | - Arnoldo Frigessi
- 1] Statistics for Innovation, Norwegian Computing Center, 0314 Oslo, Norway. [2] Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, PO Box 1122 Blindern, 0317 Oslo, Norway
| | - Anne-Lise Børresen-Dale
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway
| |
Collapse
|
11
|
Hao L, He Q, Wang Z, Craven M, Newton MA, Ahlquist P. Limited agreement of independent RNAi screens for virus-required host genes owes more to false-negative than false-positive factors. PLoS Comput Biol 2013; 9:e1003235. [PMID: 24068911 PMCID: PMC3777922 DOI: 10.1371/journal.pcbi.1003235] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 08/07/2013] [Indexed: 11/19/2022] Open
Abstract
Systematic, genome-wide RNA interference (RNAi) analysis is a powerful approach to identify gene functions that support or modulate selected biological processes. An emerging challenge shared with some other genome-wide approaches is that independent RNAi studies often show limited agreement in their lists of implicated genes. To better understand this, we analyzed four genome-wide RNAi studies that identified host genes involved in influenza virus replication. These studies collectively identified and validated the roles of 614 cell genes, but pair-wise overlap among the four gene lists was only 3% to 15% (average 6.7%). However, a number of functional categories were overrepresented in multiple studies. The pair-wise overlap of these enriched-category lists was high, ∼19%, implying more agreement among studies than apparent at the gene level. Probing this further, we found that the gene lists implicated by independent studies were highly connected in interacting networks by independent functional measures such as protein-protein interactions, at rates significantly higher than predicted by chance. We also developed a general, model-based approach to gauge the effects of false-positive and false-negative factors and to estimate, from a limited number of studies, the total number of genes involved in a process. For influenza virus replication, this novel statistical approach estimates the total number of cell genes involved to be ∼2,800. This and multiple other aspects of our experimental and computational results imply that, when following good quality control practices, the low overlap between studies is primarily due to false negatives rather than false-positive gene identifications. These results and methods have implications for and applications to multiple forms of genome-wide analysis. Genome-wide RNA interference assays of gene functions offer the potential for systematic, global analysis of biological processes. A pressing challenge is to develop meta-analysis methods that effectively combine information from multiple studies. One puzzle is that implicated gene lists from independent studies of the same process often show relatively low overlap. This disagreement might arise from false-positive factors, such as imperfect gene targeting (off-target effects), or from false negatives if separate studies access different components of large, complex systems. We present new methods to examine the relations between individual genome-wide RNAi studies, using studies of host genes in influenza virus replication as a test case. We find that cross-study agreement is greater than suggested by overlap of reported gene lists. This better agreement is evidenced by the strong relation of independent gene lists in functional pathways and protein interaction networks, and by a statistical model that relates multi-study, gene-level findings to factors driving correct, false-negative, and false-positive gene identification. Our analysis of multiple genome-wide studies predicts that there are many undetected host genes important for influenza virus infection, and that false negatives are the major concerns for genome-wide studies.
Collapse
Affiliation(s)
- Linhui Hao
- Institute of Molecular Virology, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Howard Hughes Medical Institute, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Qiuling He
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Zhishi Wang
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Mark Craven
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Michael A. Newton
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- * E-mail: (MAN); (PA)
| | - Paul Ahlquist
- Institute of Molecular Virology, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Howard Hughes Medical Institute, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Morgridge Institute for Research, Madison, Wisconsin, United States of America
- * E-mail: (MAN); (PA)
| |
Collapse
|
12
|
Gu Z, Liu J, Cao K, Zhang J, Wang J. Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes. BMC SYSTEMS BIOLOGY 2012; 6:56. [PMID: 22672776 PMCID: PMC3443660 DOI: 10.1186/1752-0509-6-56] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Accepted: 05/24/2012] [Indexed: 12/18/2022]
Abstract
Background Biological pathways are important for understanding biological mechanisms. Thus, finding important pathways that underlie biological problems helps researchers to focus on the most relevant sets of genes. Pathways resemble networks with complicated structures, but most of the existing pathway enrichment tools ignore topological information embedded within pathways, which limits their applicability. Results A systematic and extensible pathway enrichment method in which nodes are weighted by network centrality was proposed. We demonstrate how choice of pathway structure and centrality measurement, as well as the presence of key genes, affects pathway significance. We emphasize two improvements of our method over current methods. First, allowing for the diversity of genes’ characters and the difficulty of covering gene importance from all aspects, we set centrality as an optional parameter in the model. Second, nodes rather than genes form the basic unit of pathways, such that one node can be composed of several genes and one gene may reside in different nodes. By comparing our methodology to the original enrichment method using both simulation data and real-world data, we demonstrate the efficacy of our method in finding new pathways from biological perspective. Conclusions Our method can benefit the systematic analysis of biological pathways and help to extract more meaningful information from gene expression data. The algorithm has been implemented as an R package CePa, and also a web-based version of CePa is provided.
Collapse
Affiliation(s)
- Zuguang Gu
- The State Key Laboratory of Pharmaceutical Biotechnology and Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Science, Nanjing University, Nanjing, 210093, China
| | | | | | | | | |
Collapse
|