1
|
Ouyang Y, Liu J, Tong T, Xu W. A rank-based high-dimensional test for equality of mean vectors. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
2
|
Dong P, Lin L. Neyman’s truncation test for two-sample means under high dimensional setting. BRAZ J PROBAB STAT 2022. [DOI: 10.1214/21-bjps519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Ping Dong
- School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai, China
| | - Lu Lin
- School of Statistics, Shandong Technology and Business University, Yantai, China
| |
Collapse
|
3
|
Li R, Xu K, Zhou Y, Zhu L. Testing the effects of high-dimensional covariates via aggregating cumulative covariances. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2044334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Runze Li
- The Pennsylvania State University
| | | | | | | |
Collapse
|
4
|
Peng L, Qu L, Nettleton D. Variable importance assessments and backward variable selection for multi-sample problems. J MULTIVARIATE ANAL 2021. [DOI: 10.1016/j.jmva.2021.104807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
5
|
Cao X, Pounds S. Gene-set distance analysis (GSDA): a powerful tool for gene-set association analysis. BMC Bioinformatics 2021; 22:207. [PMID: 33882829 PMCID: PMC8059024 DOI: 10.1186/s12859-021-04110-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 03/30/2021] [Indexed: 11/23/2022] Open
Abstract
Background Identifying sets of related genes (gene sets) that are empirically associated with a treatment or phenotype often yields valuable biological insights. Several methods effectively identify gene sets in which individual genes have simple monotonic relationships with categorical, quantitative, or censored event-time variables. Some distance-based methods, such as distance correlations, may detect complex non-monotone associations of a gene-set with a quantitative variable that elude other methods. However, the distance correlations have yet to be generalized to associate gene-sets with categorical and censored event-time endpoints. Also, there is a need to determine which genes empirically drive the significance of an association of a gene set with an endpoint. Results We develop gene-set distance analysis (GSDA) by generalizing distance correlations to evaluate the association of a gene set with categorical and censored event-time variables. We also develop a backward elimination procedure to identify a subset of genes that empirically drive significant associations. In simulation studies, GSDA more effectively identified complex non-monotone gene-set associations than did six other published methods. In the analysis of a pediatric acute myeloid leukemia (AML) data set, GSDA was the only method to discover that event-free survival (EFS) was associated with the 56-gene AML pathway gene-set, narrow that result down to 5 genes, and confirm the association of those 5 genes with EFS in a separate validation cohort. These results indicate that GSDA effectively identifies and characterizes complex non-monotonic gene-set associations that are missed by other methods. Conclusion GSDA is a powerful and flexible method to detect gene-set association with categorical, quantitative, or censored event-time variables, especially to detect complex non-monotonic gene-set associations. Available at https://CRAN.R-project.org/package=GSDA. Supplementary information The online version contains supplementary material available at 10.1186/s12859-021-04110-x.
Collapse
Affiliation(s)
- Xueyuan Cao
- Department of Acute and Tertiary Care, University of Tennessee Health Science Center, Memphis, 38163, USA
| | - Stan Pounds
- Department of Biostatistics, St Jude Children's Research Hospital, Memphis, 38105, USA.
| |
Collapse
|
6
|
Abstract
BACKGROUND With the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in unsubstantiated claims of a method's performance. RESULTS Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. CONCLUSIONS Using data simulated from a theoretical model can substantially impact the results of a study. We developed more realistic simulation techniques for RNA-seq data. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Network: https://cran.r-project.org/package=seqgendiff.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Massachusetts Ave NW, Washington, DC, 20016, USA.
| |
Collapse
|
7
|
Stress response to CO2 deprivation by Arabidopsis thaliana in plant cultures. PLoS One 2019; 14:e0212462. [PMID: 30865661 PMCID: PMC6415875 DOI: 10.1371/journal.pone.0212462] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 02/02/2019] [Indexed: 12/17/2022] Open
Abstract
After being the standard plant propagation protocol for decades, cultures of Arabidopsis thaliana sealed with Parafilm remain common today out of practicality, habit, or necessity (as in co-cultures with microorganisms). Regardless of concerns over the aeration of these cultures, no investigation has explored the CO2 transport inside these cultures and its effect on the plants. Thereby, it was impossible to assess whether Parafilm-seals used today or in thousands of older papers in the literature constitute a treatment, and whether this treatment could potentially affect the study of other treatments.For the first time we report the CO2 concentrations in Parafilm-sealed cultures of A. thaliana with a 1 minute temporal resolution, and the transcriptome comparison with aerated cultures. The data show significant CO2 deprivation to the plants, a drastic suppression of photosynthesis, respiration, starch accumulation, chlorophyll biosynthesis, and an increased accumulation of reactive oxygen species. Most importantly, CO2 deprivation occurs as soon as the cotyledons emerge. Gene expression analysis indicates a significant alteration of 35% of the pathways when compared to aerated cultures, especially in stress response and secondary metabolism processes. On the other hand, the observed increase in the production of glucosinolates and flavonoids suggests intriguing possibilities for CO2 deprivation as an organic biofortification treatment in high-value crops.
Collapse
|
8
|
POST: A framework for set-based association analysis in high-dimensional data. Methods 2018; 145:76-81. [PMID: 29777750 DOI: 10.1016/j.ymeth.2018.05.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Revised: 05/11/2018] [Accepted: 05/13/2018] [Indexed: 01/08/2023] Open
Abstract
Evaluating the differential expression of a set of genes belonging to a common biological process or ontology has proven to be a very useful tool for biological discovery. However, existing gene-set association methods are limited to applications that evaluate differential expression across k⩾2 treatment groups or biological categories. This limitation precludes researchers from most effectively evaluating the association with other phenotypes that may be more clinically meaningful, such as quantitative variables or censored survival time variables. Projection onto the Orthogonal Space Testing (POST) is proposed as a general procedure that can robustly evaluate the association of a gene-set with several different types of phenotypic data (categorical, ordinal, continuous, or censored). For each gene-set, POST transforms the gene profiles into a set of eigenvectors and then uses statistical modeling to compute a set of z-statistics that measure the association of each eigenvector with the phenotype. The overall gene-set statistic is the sum of squared z-statistics weighted by the corresponding eigenvalues. Finally, bootstrapping is used to compute a p-value. POST may evaluate associations with or without adjustment for covariates. In simulation studies, it is shown that the performance of POST in evaluating the association with a categorical phenotype is similar to or exceeds that of existing methods. In evaluating the association of 875 biological processes with the time to relapse of pediatric acute myeloid leukemia, POST identified the well-known oncogenic WNT signaling pathway as its top hit. These results indicate that POST can be a very useful tool for evaluating the association of a gene-set with a variety of different phenotypes. We have developed an R package named POST which is freely available in Bioconductor.
Collapse
|
9
|
Liang K, Du C, You H, Nettleton D. A hidden Markov tree model for testing multiple hypotheses corresponding to Gene Ontology gene sets. BMC Bioinformatics 2018; 19:107. [PMID: 29587646 PMCID: PMC5869792 DOI: 10.1186/s12859-018-2106-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 03/05/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Testing predefined gene categories has become a common practice for scientists analyzing high throughput transcriptome data. A systematic way of testing gene categories leads to testing hundreds of null hypotheses that correspond to nodes in a directed acyclic graph. The relationships among gene categories induce logical restrictions among the corresponding null hypotheses. An existing fully Bayesian method is powerful but computationally demanding. RESULTS We develop a computationally efficient method based on a hidden Markov tree model (HMTM). Our method is several orders of magnitude faster than the existing fully Bayesian method. Through simulation and an expression quantitative trait loci study, we show that the HMTM method provides more powerful results than other existing methods that honor the logical restrictions. CONCLUSIONS The HMTM method provides an individual estimate of posterior probability of being differentially expressed for each gene set, which can be useful for result interpretation. The R package can be found on https://github.com/k22liang/HMTGO .
Collapse
Affiliation(s)
- Kun Liang
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, N2L 3G1, Canada.
| | - Chuanlong Du
- Department of Statistics, Iowa State University, Ames, 50011, USA
| | - Hankun You
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, N2L 3G1, Canada
| | - Dan Nettleton
- Department of Statistics, Iowa State University, Ames, 50011, USA
| |
Collapse
|
10
|
Xie XP, Gan B, Yang W, Wang HQ. ctPath: Demixing pathway crosstalk effect from transcriptomics data for differential pathway identification. J Biomed Inform 2017; 73:104-114. [PMID: 28756161 DOI: 10.1016/j.jbi.2017.07.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Revised: 07/25/2017] [Accepted: 07/25/2017] [Indexed: 12/17/2022]
Abstract
Identifying differentially expressed pathways (DEPs) plays important roles in understanding tumor etiology and promoting clinical treatment of cancer or other diseases. By assuming gene expression to be a sparse non-negative linear combination of hidden pathway signals, we propose a pathway crosstalk-based transcriptomics data analysis method (ctPath) for identifying differentially expressed pathways. Biologically, pathways of different functions work in concert at the systematic level. The proposed method interrogates the crosstalks between pathways and discovers hidden pathway signals by mapping high-dimensional transcriptomics data into a low-dimensional pathway space. The resulted pathway signals reflect the activity level of pathways after removing pathway crosstalk effect and allow a robust identification of DEPs from inherently complex and noisy transcriptomics data. CtPath can also correct incomplete and inaccurate pathway annotations which frequently occur in public repositories. Experimental results on both simulation data and real-world cancer data demonstrate the superior performance of ctPath over other popular approaches. R code for ctPath is available for non-commercial use at the URL http://micblab.iim.ac.cn/Download/.
Collapse
Affiliation(s)
- Xin-Ping Xie
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, Anhui, China
| | - Bin Gan
- Biological Molecular Information System Lab., Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China
| | - Wulin Yang
- Center for Medical Physics and Technology, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China; Cancer Hospital, CAS, Hefei, Anhui, China
| | - Hong-Qiang Wang
- Biological Molecular Information System Lab., Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China; Center for Medical Physics and Technology, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China; Cancer Hospital, CAS, Hefei, Anhui, China.
| |
Collapse
|
11
|
Abstract
Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA. Pathway analysis is a common approach to quickly access the pathways being regulated in the experiments. There are numerous statistics to perform pathway analysis; most of them assume that the genes or proteins are independent of each other for statistical ease. This assumption, however, is unrealistic to the real biological system and may cause false positives in practice. A standard way to address this issue is to measure the associations among genes or proteins. Unfortunately, the estimation of associations requires sufficient sample size, which is usually not available for proteomic data produced by mass spectrometry. In this study, we propose a T2-statistic, which estimates the associations among gene products, to perform pathway analysis for quantitative proteomic data. Instead of calculating the associations directly from data, we use the confidence scores retrieved from protein-protein interaction databases. We also design an integrating procedure to reserve pathways of sufficient evidence as a regulated pathway group. We compare the proposed T2-statistic to other popular statistics using five published experimental datasets, and the T2-statistic yields more accurate descriptions in agreement with the discussion of the original papers.
Collapse
|
12
|
Chang J, Zheng C, Zhou WX, Zhou W. Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 2017; 73:1300-1310. [PMID: 28369742 DOI: 10.1111/biom.12695] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Revised: 02/01/2017] [Accepted: 02/01/2017] [Indexed: 11/29/2022]
Abstract
In this article, we study the problem of testing the mean vectors of high dimensional data in both one-sample and two-sample cases. The proposed testing procedures employ maximum-type statistics and the parametric bootstrap techniques to compute the critical values. Different from the existing tests that heavily rely on the structural conditions on the unknown covariance matrices, the proposed tests allow general covariance structures of the data and therefore enjoy wide scope of applicability in practice. To enhance powers of the tests against sparse alternatives, we further propose two-step procedures with a preliminary feature screening step. Theoretical properties of the proposed tests are investigated. Through extensive numerical experiments on synthetic data sets and an human acute lymphoblastic leukemia gene expression data set, we illustrate the performance of the new tests and how they may provide assistance on detecting disease-associated gene-sets. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.
Collapse
Affiliation(s)
- Jinyuan Chang
- School of Statistics, Southwestern University of Finance and Economics, Chengdu, Sichuan 611130, China
| | - Chao Zheng
- School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Wen-Xin Zhou
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544, U.S.A
| | - Wen Zhou
- Department of Statistics, Colorado State University, Fort Collins, Colorado 80523, U.S.A
| |
Collapse
|
13
|
Broniscer A, Hwang SN, Chamdine O, Lin T, Pounds S, Onar-Thomas A, Chi L, Shurtleff S, Allen S, Gajjar A, Northcott P, Orr BA. Bithalamic gliomas may be molecularly distinct from their unilateral high-grade counterparts. Brain Pathol 2017; 28:112-120. [PMID: 28032389 DOI: 10.1111/bpa.12484] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Accepted: 12/20/2016] [Indexed: 12/18/2022] Open
Abstract
Bithalamic gliomas are rare cancers diagnosed based on poorly defined radiologic criteria. Infiltrative astrocytomas account for most cases. While some previous studies reported dismal outcomes for patients with bithalamic gliomas irrespective of therapy and histologic grade, others described better prognoses even without anticancer therapy. Little is known about their molecular characteristics. We reviewed clinical, radiologic, and histologic features of patients with bithalamic gliomas treated at our institution over 15 years. Targeted sequencing of mutational hotspots in H3F3A, HIST1H3B, IDH1/2, and BRAF, and genome-wide analysis of DNA methylation and copy number abnormalities was performed in available tumors. Eleven patients with bithalamic gliomas were identified. Their median age at diagnosis was 4.8 years (range: 1-15.7). Additional involvement of the brainstem, basal ganglia, and cerebral lobes occurred in 11, 9, and 3 cases, respectively. All patients presented with hydrocephalus. Two-thirds of the patients had a histologic diagnosis of anaplastic astrocytoma. Despite aggressive therapy, our youngest patient, the only one diagnosed before 1 year of age, is the sole long-term survivor. DNA methylation could be performed in seven tumors, all of which clustered with the RTK I 'PDGFRA' subgroup by unsupervised hierarchical analysis of methylation array against a previously published cohort of 59 pediatric high-grade gliomas. Sequencing of hotspots mutations could be done in 10 tumors, none of which harbored H3F3A p.K27 and/or the respective DNA methylation signature, and any other hotspot mutations. Amplification of MDM4 (n = 2), PDGFRA (n = 2), and ID2 combined with MYCN (n = 1) were observed in 7 tumors available for analysis. In comparison with the previously published experience with unilateral high-grade thalamic astrocytomas where H3F3A p.K27 was present in two-thirds of cases, the absence of this molecular subgroup in bithalamic gliomas was striking. This finding suggests that unilateral and bithalamic high-grade gliomas may represent two distinct molecular entities.
Collapse
Affiliation(s)
- Alberto Broniscer
- Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN.,Department of Pediatrics, University of Tennessee Health Science Center, Memphis, TN
| | - Scott N Hwang
- Department of Diagnostic Imaging, St. Jude Children's Research Hospital, Memphis, TN
| | - Omar Chamdine
- Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN
| | - Tong Lin
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN
| | - Stanley Pounds
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN
| | - Arzu Onar-Thomas
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN
| | - Lei Chi
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN
| | - Sheila Shurtleff
- Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN
| | - Sariah Allen
- Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN
| | - Amar Gajjar
- Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN.,Department of Pediatrics, University of Tennessee Health Science Center, Memphis, TN
| | - Paul Northcott
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, TN
| | - Brent A Orr
- Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN
| |
Collapse
|
14
|
Feng L, Zou C, Wang Z. Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1035380] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
15
|
Sokolov A, Carlin DE, Paull EO, Baertsch R, Stuart JM. Pathway-Based Genomics Prediction using Generalized Elastic Net. PLoS Comput Biol 2016; 12:e1004790. [PMID: 26960204 PMCID: PMC4784899 DOI: 10.1371/journal.pcbi.1004790] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Accepted: 02/04/2016] [Indexed: 11/19/2022] Open
Abstract
We present a novel regularization scheme called The Generalized Elastic Net (GELnet) that incorporates gene pathway information into feature selection. The proposed formulation is applicable to a wide variety of problems in which the interpretation of predictive features using known molecular interactions is desired. The method naturally steers solutions toward sets of mechanistically interlinked genes. Using experiments on synthetic data, we demonstrate that pathway-guided results maintain, and often improve, the accuracy of predictors even in cases where the full gene network is unknown. We apply the method to predict the drug response of breast cancer cell lines. GELnet is able to reveal genetic determinants of sensitivity and resistance for several compounds. In particular, for an EGFR/HER2 inhibitor, it finds a possible trans-differentiation resistance mechanism missed by the corresponding pathway agnostic approach.
Collapse
Affiliation(s)
- Artem Sokolov
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Daniel E. Carlin
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Evan O. Paull
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Robert Baertsch
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Joshua M. Stuart
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| |
Collapse
|
16
|
|
17
|
Kim S, Ahn JY, Lee W. On high-dimensional two sample mean testing statistics: a comparative study with a data adaptive choice of coefficient vector. Comput Stat 2015. [DOI: 10.1007/s00180-015-0605-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Benidt S, Nettleton D. SimSeq: a nonparametric approach to simulation of RNA-sequence datasets. ACTA ACUST UNITED AC 2015; 31:2131-40. [PMID: 25725090 DOI: 10.1093/bioinformatics/btv124] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 02/23/2015] [Indexed: 02/02/2023]
Abstract
MOTIVATION RNA sequencing analysis methods are often derived by relying on hypothetical parametric models for read counts that are not likely to be precisely satisfied in practice. Methods are often tested by analyzing data that have been simulated according to the assumed model. This testing strategy can result in an overly optimistic view of the performance of an RNA-seq analysis method. RESULTS We develop a data-based simulation algorithm for RNA-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source RNA-seq dataset provided by the user. We conduct simulation experiments based on the negative binomial distribution and our proposed nonparametric simulation algorithm. We compare performance between the two simulation experiments over a small subset of statistical methods for RNA-seq analysis available in the literature. We use as a benchmark the ability of a method to control the false discovery rate. Not surprisingly, methods based on parametric modeling assumptions seem to perform better with respect to false discovery rate control when data are simulated from parametric models rather than using our more realistic nonparametric simulation strategy. AVAILABILITY AND IMPLEMENTATION The nonparametric simulation algorithm developed in this article is implemented in the R package SimSeq, which is freely available under the GNU General Public License (version 2 or later) from the Comprehensive R Archive Network (http://cran.rproject.org/). CONTACT sgbenidt@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sam Benidt
- Department of Statistics, Iowa State University, Ames, IA 50011-1210, USA
| | - Dan Nettleton
- Department of Statistics, Iowa State University, Ames, IA 50011-1210, USA
| |
Collapse
|
19
|
Li L, Hur M, Lee JY, Zhou W, Song Z, Ransom N, Demirkale CY, Nettleton D, Westgate M, Arendsee Z, Iyer V, Shanks J, Nikolau B, Wurtele ES. A systems biology approach toward understanding seed composition in soybean. BMC Genomics 2015; 16 Suppl 3:S9. [PMID: 25708381 PMCID: PMC4331812 DOI: 10.1186/1471-2164-16-s3-s9] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND The molecular, biochemical, and genetic mechanisms that regulate the complex metabolic network of soybean seed development determine the ultimate balance of protein, lipid, and carbohydrate stored in the mature seed. Many of the genes and metabolites that participate in seed metabolism are unknown or poorly defined; even more remains to be understood about the regulation of their metabolic networks. A global omics analysis can provide insights into the regulation of seed metabolism, even without a priori assumptions about the structure of these networks. RESULTS With the future goal of predictive biology in mind, we have combined metabolomics, transcriptomics, and metabolic flux technologies to reveal the global developmental and metabolic networks that determine the structure and composition of the mature soybean seed. We have coupled this global approach with interactive bioinformatics and statistical analyses to gain insights into the biochemical programs that determine soybean seed composition. For this purpose, we used Plant/Eukaryotic and Microbial Metabolomics Systems Resource (PMR, http://www.metnetdb.org/pmr, a platform that incorporates metabolomics data to develop hypotheses concerning the organization and regulation of metabolic networks, and MetNet systems biology tools http://www.metnetdb.org for plant omics data, a framework to enable interactive visualization of metabolic and regulatory networks. CONCLUSIONS This combination of high-throughput experimental data and bioinformatics analyses has revealed sets of specific genes, genetic perturbations and mechanisms, and metabolic changes that are associated with the developmental variation in soybean seed composition. Researchers can explore these metabolomics and transcriptomics data interactively at PMR.
Collapse
Affiliation(s)
- Ling Li
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Biorenewable Chemicals, Iowa State University, Ames, Iowa 50011, USA
| | - Manhoi Hur
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Biorenewable Chemicals, Iowa State University, Ames, Iowa 50011, USA
| | - Joon-Yong Lee
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
| | - Wenxu Zhou
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, USA
| | - Zhihong Song
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, USA
| | - Nick Ransom
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
| | | | - Dan Nettleton
- Department of Statistics, Iowa State University, Ames, Iowa 50011, USA
| | - Mark Westgate
- Department of Agronomy, Iowa State University, Ames, Iowa 50011, USA
| | - Zebulun Arendsee
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
| | - Vidya Iyer
- Department of Chemical and Biological Engineering, Iowa State University, Ames, Iowa 50011, USA
| | - Jackie Shanks
- Department of Chemical and Biological Engineering, Iowa State University, Ames, Iowa 50011, USA
- Center for Biorenewable Chemicals, Iowa State University, Ames, Iowa 50011, USA
| | - Basil Nikolau
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Biorenewable Chemicals, Iowa State University, Ames, Iowa 50011, USA
| | - Eve Syrkin Wurtele
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, Iowa 50011, USA
- Center for Biorenewable Chemicals, Iowa State University, Ames, Iowa 50011, USA
| |
Collapse
|
20
|
Saunders G, Stevens JR, Isom SC. A shortcut for multiple testing on the directed acyclic graph of gene ontology. BMC Bioinformatics 2014; 15:349. [PMID: 25366961 PMCID: PMC4232707 DOI: 10.1186/s12859-014-0349-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2014] [Accepted: 10/09/2014] [Indexed: 01/19/2023] Open
Abstract
Background Gene set testing has become an important analysis technique in high throughput microarray and next generation sequencing studies for uncovering patterns of differential expression of various biological processes. Often, the large number of gene sets that are tested simultaneously require some sort of multiplicity correction to account for the multiplicity effect. This work provides a substantial computational improvement to an existing familywise error rate controlling multiplicity approach (the Focus Level method) for gene set testing in high throughput microarray and next generation sequencing studies using Gene Ontology graphs, which we call the Short Focus Level. Results The Short Focus Level procedure, which performs a shortcut of the full Focus Level procedure, is achieved by extending the reach of graphical weighted Bonferroni testing to closed testing situations where restricted hypotheses are present, such as in the Gene Ontology graphs. The Short Focus Level multiplicity adjustment can perform the full top-down approach of the original Focus Level procedure, overcoming a significant disadvantage of the otherwise powerful Focus Level multiplicity adjustment. The computational and power differences of the Short Focus Level procedure as compared to the original Focus Level procedure are demonstrated both through simulation and using real data. Conclusions The Short Focus Level procedure shows a significant increase in computation speed over the original Focus Level procedure (as much as ∼15,000 times faster). The Short Focus Level should be used in place of the Focus Level procedure whenever the logical assumptions of the Gene Ontology graph structure are appropriate for the study objectives and when either no a priori focus level of interest can be specified or the focus level is selected at a higher level of the graph, where the Focus Level procedure is computationally intractable. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0349-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Garrett Saunders
- Utah State University, Department of Mathematics & Statistics, Logan, Utah, USA. .,Brigham Young University-Idaho, Department of Mathematics, Rexburg, Idaho, USA.
| | - John R Stevens
- Utah State University, Department of Mathematics & Statistics, Logan, Utah, USA.
| | - S Clay Isom
- Utah State University, Department of Animal, Dairy, and Veterinary Sciences, Logan, Utah, USA.
| |
Collapse
|
21
|
Emmert-Streib F, Tripathi S, Simoes RDM, Hawwa AF, Dehmer M. The human disease network. ACTA ACUST UNITED AC 2014. [DOI: 10.4161/sysb.22816] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
|
23
|
Clark NR, Hu KS, Feldmann AS, Kou Y, Chen EY, Duan Q, Ma'ayan A. The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC Bioinformatics 2014; 15:79. [PMID: 24650281 PMCID: PMC4000056 DOI: 10.1186/1471-2105-15-79] [Citation(s) in RCA: 129] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2013] [Accepted: 03/11/2014] [Indexed: 11/13/2022] Open
Abstract
Background Identifying differentially expressed genes (DEG) is a fundamental step in studies that perform genome wide expression profiling. Typically, DEG are identified by univariate approaches such as Significance Analysis of Microarrays (SAM) or Linear Models for Microarray Data (LIMMA) for processing cDNA microarrays, and differential gene expression analysis based on the negative binomial distribution (DESeq) or Empirical analysis of Digital Gene Expression data in R (edgeR) for RNA-seq profiling. Results Here we present a new geometrical multivariate approach to identify DEG called the Characteristic Direction. We demonstrate that the Characteristic Direction method is significantly more sensitive than existing methods for identifying DEG in the context of transcription factor (TF) and drug perturbation responses over a large number of microarray experiments. We also benchmarked the Characteristic Direction method using synthetic data, as well as RNA-Seq data. A large collection of microarray expression data from TF perturbations (73 experiments) and drug perturbations (130 experiments) extracted from the Gene Expression Omnibus (GEO), as well as an RNA-Seq study that profiled genome-wide gene expression and STAT3 DNA binding in two subtypes of diffuse large B-cell Lymphoma, were used for benchmarking the method using real data. ChIP-Seq data identifying DNA binding sites of the perturbed TFs, as well as known drug targets of the perturbing drugs, were used as prior knowledge silver-standard for validation. In all cases the Characteristic Direction DEG calling method outperformed other methods. We find that when drugs are applied to cells in various contexts, the proteins that interact with the drug-targets are differentially expressed and more of the corresponding genes are discovered by the Characteristic Direction method. In addition, we show that the Characteristic Direction conceptualization can be used to perform improved gene set enrichment analyses when compared with the gene-set enrichment analysis (GSEA) and the hypergeometric test. Conclusions The application of the Characteristic Direction method may shed new light on relevant biological mechanisms that would have remained undiscovered by the current state-of-the-art DEG methods. The method is freely accessible via various open source code implementations using four popular programming languages: R, Python, MATLAB and Mathematica, all available at: http://www.maayanlab.net/CD.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Avi Ma'ayan
- Department of Pharmacology and Systems Therapeutics, Systems Biology Center New York (SBCNY), Icahn School of Medicine at Mount Sinai School, New York, NY 10029, USA.
| |
Collapse
|
24
|
Soheila K, Hamid A, Farid Z, Mostafa RT, Nasrin DN, Syyed-Mohammad T, Vahide T. Comparison of univariate and multivariate gene set analysis in acute lymphoblastic leukemia. Asian Pac J Cancer Prev 2013; 14:1629-33. [PMID: 23679247 DOI: 10.7314/apjcp.2013.14.3.1629] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene set analysis (GSA) incorporates biological with statistical knowledge to identify gene sets which are differentially expressed that between two or more phenotypes. MATERIALS AND METHODS In this paper gene sets differentially expressed between acute lymphoblastic leukaemia (ALL) with BCR-ABL and those with no observed cytogenetic abnormalities were determined by GSA methods. The BCR-ABL is an abnormal gene found in some people with ALL. RESULTS The results of two GSAs showed that the Category test identified 30 gene sets differentially expressed between two phenotypes, while the Hotelling's T2 could discover just 19 gene sets. On the other hand, assessment of common genes among significant gene sets showed that there were high agreement between the results of GSA and the findings of biologists. In addition, the performance of these methods was compared by simulated and ALL data. CONCLUSIONS The results on simulated data indicated decrease in the type I error rate and increase the power in multivariate (Hotelling's T2) test as increasing the correlation between gene pairs in contrast to the univariate (Category) test.
Collapse
Affiliation(s)
- Khodakarim Soheila
- Department of Epidemiology, Faculty of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | | | | | | | | | | | | |
Collapse
|
25
|
Assessment method for a power analysis to identify differentially expressed pathways. PLoS One 2012; 7:e37510. [PMID: 22629411 PMCID: PMC3356338 DOI: 10.1371/journal.pone.0037510] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2011] [Accepted: 04/22/2012] [Indexed: 12/04/2022] Open
Abstract
Gene expression data can provide a very rich source of information for elucidating the biological function on the pathway level if the experimental design considers the needs of the statistical analysis methods. The purpose of this paper is to provide a comparative analysis of statistical methods for detecting the differentially expression of pathways (DEP). In contrast to many other studies conducted so far, we use three novel simulation types, producing a more realistic correlation structure than previous simulation methods. This includes also the generation of surrogate data from two large-scale microarray experiments from prostate cancer and ALL. As a result from our comprehensive analysis of parameter configurations, we find that each method should only be applied if certain conditions of the data from a pathway are met. Further, we provide method-specific estimates for the optimal sample size for microarray experiments aiming to identify DEP in order to avoid an underpowered design. Our study highlights the sensitivity of the studied methods on the parameters of the system.
Collapse
|
26
|
|
27
|
Van Hemert JL, Dickerson JA. Discriminating response groups in metabolic and regulatory pathway networks. Bioinformatics 2012; 28:947-54. [DOI: 10.1093/bioinformatics/bts039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
28
|
Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. AMERICAN JOURNAL OF BOTANY 2012; 99:248-56. [PMID: 22268221 DOI: 10.3732/ajb.1100340] [Citation(s) in RCA: 156] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/18/2023]
Abstract
RNA-Seq technologies are quickly revolutionizing genomic studies, and statistical methods for RNA-seq data are under continuous development. Timely review and comparison of the most recently proposed statistical methods will provide a useful guide for choosing among them for data analysis. Particular interest surrounds the ability to detect differential expression (DE) in genes. Here we compare four recently proposed statistical methods, edgeR, DESeq, baySeq, and a method with a two-stage Poisson model (TSPM), through a variety of simulations that were based on different distribution models or real data. We compared the ability of these methods to detect DE genes in terms of the significance ranking of genes and false discovery rate control. All methods compared are implemented in freely available software. We also discuss the availability and functions of the currently available versions of these software.
Collapse
Affiliation(s)
- Vanessa M Kvam
- Department of Statistics, Iowa State University, Snedecor Hall, Ames, Iowa 50011-1210, USA
| | | | | |
Collapse
|
29
|
Liang K, Nettleton D. A Hidden Markov Model Approach to Testing Multiple Hypotheses on a Tree-Transformed Gene Ontology Graph. J Am Stat Assoc 2012. [DOI: 10.1198/jasa.2010.tm10195] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Affiliation(s)
- Kun Liang
- Kun Liang is Doctoral Candidate, Department of Statistics, Iowa State University, Ames, IA 50011 . Dan Nettleton is Professor, Department of Statistics, Iowa State University, Ames, IA 50011 . This work was supported by the National Science Foundation grant No. 0714978 and CCF-0811804, and National Research Initiative of the USDA-CSREES grant No. 2008-35600-18786. The authors wish to thank the editor, the associate editor, and two reviewers for their comments and suggestions which led to an improved
| | - Dan Nettleton
- Kun Liang is Doctoral Candidate, Department of Statistics, Iowa State University, Ames, IA 50011 . Dan Nettleton is Professor, Department of Statistics, Iowa State University, Ames, IA 50011 . This work was supported by the National Science Foundation grant No. 0714978 and CCF-0811804, and National Research Initiative of the USDA-CSREES grant No. 2008-35600-18786. The authors wish to thank the editor, the associate editor, and two reviewers for their comments and suggestions which led to an improved
| |
Collapse
|
30
|
Podpečan V, Lavrač N, Mozetič I, Novak PK, Trajkovski I, Langohr L, Kulovesi K, Toivonen H, Petek M, Motaln H, Gruden K. SegMine workflows for semantic microarray data analysis in Orange4WS. BMC Bioinformatics 2011; 12:416. [PMID: 22029475 PMCID: PMC3216973 DOI: 10.1186/1471-2105-12-416] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2011] [Accepted: 10/26/2011] [Indexed: 12/03/2022] Open
Abstract
Background In experimental data analysis, bioinformatics researchers increasingly rely on tools that enable the composition and reuse of scientific workflows. The utility of current bioinformatics workflow environments can be significantly increased by offering advanced data mining services as workflow components. Such services can support, for instance, knowledge discovery from diverse distributed data and knowledge sources (such as GO, KEGG, PubMed, and experimental databases). Specifically, cutting-edge data analysis approaches, such as semantic data mining, link discovery, and visualization, have not yet been made available to researchers investigating complex biological datasets. Results We present a new methodology, SegMine, for semantic analysis of microarray data by exploiting general biological knowledge, and a new workflow environment, Orange4WS, with integrated support for web services in which the SegMine methodology is implemented. The SegMine methodology consists of two main steps. First, the semantic subgroup discovery algorithm is used to construct elaborate rules that identify enriched gene sets. Then, a link discovery service is used for the creation and visualization of new biological hypotheses. The utility of SegMine, implemented as a set of workflows in Orange4WS, is demonstrated in two microarray data analysis applications. In the analysis of senescence in human stem cells, the use of SegMine resulted in three novel research hypotheses that could improve understanding of the underlying mechanisms of senescence and identification of candidate marker genes. Conclusions Compared to the available data analysis systems, SegMine offers improved hypothesis generation and data interpretation for bioinformatics in an easy-to-use integrated workflow environment.
Collapse
|
31
|
Basu S, Pan W, Shen X, Oetting WS. Multilocus association testing with penalized regression. Genet Epidemiol 2011; 35:755-65. [PMID: 21922539 DOI: 10.1002/gepi.20625] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2011] [Revised: 06/09/2011] [Accepted: 07/04/2011] [Indexed: 12/26/2022]
Abstract
In multilocus association analysis, since some markers may not be associated with a trait, it seems attractive to use penalized regression with the capability of automatic variable selection. On the other hand, in spite of a rapidly growing body of literature on penalized regression, most focus on variable selection and outcome prediction, for which penalized methods are generally more effective than their nonpenalized counterparts. However, for statistical inference, i.e. hypothesis testing and interval estimation, it is less clear how penalized methods would perform, or even how to best apply them, largely due to lack of studies on this topic. In our motivating data for a cohort of kidney transplant recipients, it is of primary interest to assess whether a group of genetic variants are associated with a binary clinical outcome, acute rejection at 6 months. In this article, we study some technical issues and alternative implementations of hypothesis testing in Lasso penalized logistic regression, and compare their performance with each other and with several existing global tests, some of which are specifically designed as variance component tests for high-dimensional data. The most interesting, and perhaps surprising, conclusion of this study is that, for low to moderately high-dimensional data, statistical tests based on Lasso penalized regression are not necessarily more powerful than some existing global tests. In addition, in penalized regression, rather than building a test based on a single selected "best" model, combining multiple tests, each of which is built on a candidate model, might be more promising.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | | | | | | |
Collapse
|
32
|
Sohn I, Owzar K, Lim J, George SL, Mackey Cushman S, Jung SH. Multiple testing for gene sets from microarray experiments. BMC Bioinformatics 2011; 12:209. [PMID: 21615889 PMCID: PMC3131260 DOI: 10.1186/1471-2105-12-209] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Accepted: 05/26/2011] [Indexed: 11/25/2022] Open
Abstract
Background A key objective in many microarray association studies is the identification of individual genes associated with clinical outcome. It is often of additional interest to identify sets of genes, known a priori to have similar biologic function, associated with the outcome. Results In this paper, we propose a general permutation-based framework for gene set testing that controls the false discovery rate (FDR) while accounting for the dependency among the genes within and across each gene set. The application of the proposed method is demonstrated using three public microarray data sets. The performance of our proposed method is contrasted to two other existing Gene Set Enrichment Analysis (GSEA) and Gene Set Analysis (GSA) methods. Conclusions Our simulations show that the proposed method controls the FDR at the desired level. Through simulations and case studies, we observe that our method performs better than GSEA and GSA, especially when the number of prognostic gene sets is large.
Collapse
Affiliation(s)
- Insuk Sohn
- Biostatistics and Bioinformatics Center, Samsung Cancer Research Institute, Samsung Medical Center, Seoul 137-710, Republic of Korea
| | | | | | | | | | | |
Collapse
|
33
|
Pathway analysis of expression data: deciphering functional building blocks of complex diseases. PLoS Comput Biol 2011; 7:e1002053. [PMID: 21637797 PMCID: PMC3102754 DOI: 10.1371/journal.pcbi.1002053] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
|
34
|
McHale CM, Zhang L, Lan Q, Vermeulen R, Li G, Hubbard AE, Porter KE, Thomas R, Portier CJ, Shen M, Rappaport SM, Yin S, Smith MT, Rothman N. Global gene expression profiling of a population exposed to a range of benzene levels. ENVIRONMENTAL HEALTH PERSPECTIVES 2011; 119:628-34. [PMID: 21147609 PMCID: PMC3094412 DOI: 10.1289/ehp.1002546] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 12/13/2010] [Indexed: 05/17/2023]
Abstract
BACKGROUND Benzene, an established cause of acute myeloid leukemia (AML), may also cause one or more lymphoid malignancies in humans. Previously, we identified genes and pathways associated with exposure to high (> 10 ppm) levels of benzene through transcriptomic analyses of blood cells from a small number of occupationally exposed workers. OBJECTIVES The goals of this study were to identify potential biomarkers of benzene exposure and/or early effects and to elucidate mechanisms relevant to risk of hematotoxicity, leukemia, and lymphoid malignancy in occupationally exposed individuals, many of whom were exposed to benzene levels < 1 ppm, the current U.S. occupational standard. METHODS We analyzed global gene expression in the peripheral blood mononuclear cells of 125 workers exposed to benzene levels ranging from < 1 ppm to > 10 ppm. Study design and analysis with a mixed-effects model minimized potential confounding and experimental variability. RESULTS We observed highly significant widespread perturbation of gene expression at all exposure levels. The AML pathway was among the pathways most significantly associated with benzene exposure. Immune response pathways were associated with most exposure levels, potentially providing biological plausibility for an association between lymphoma and benzene exposure. We identified a 16-gene expression signature associated with all levels of benzene exposure. CONCLUSIONS Our findings suggest that chronic benzene exposure, even at levels below the current U.S. occupational standard, perturbs many genes, biological processes, and pathways. These findings expand our understanding of the mechanisms by which benzene may induce hematotoxicity, leukemia, and lymphoma and reveal relevant potential biomarkers associated with a range of exposures.
Collapse
Affiliation(s)
- Cliona M McHale
- School of Public Health, University of California-Berkeley, Berkeley, California 64720, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Schwender H, Ruczinski I, Ickstadt K. Testing SNPs and sets of SNPs for importance in association studies. Biostatistics 2010; 12:18-32. [PMID: 20601626 DOI: 10.1093/biostatistics/kxq042] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Collapse
Affiliation(s)
- Holger Schwender
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA.
| | | | | |
Collapse
|
36
|
Tintle NL, Borchers B, Brown M, Bekmetjev A. Comparing gene set analysis methods on single-nucleotide polymorphism data from Genetic Analysis Workshop 16. BMC Proc 2009; 3 Suppl 7:S96. [PMID: 20018093 PMCID: PMC2796000 DOI: 10.1186/1753-6561-3-s7-s96] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recently, gene set analysis (GSA) has been extended from use on gene expression data to use on single-nucleotide polymorphism (SNP) data in genome-wide association studies. When GSA has been demonstrated on SNP data, two popular statistics from gene expression data analysis (gene set enrichment analysis [GSEA] and Fisher's exact test [FET]) have been used. However, GSEA and FET have shown a lack of power and robustness in the analysis of gene expression data. The purpose of this work is to investigate whether the same issues are also true for the analysis of SNP data. Ultimately, we conclude that GSEA and FET are not optimal for the analysis of SNP data when compared with the SUMSTAT method. In analysis of real SNP data from the Framingham Heart Study, we find that SUMSTAT finds many more gene sets to be significant when compared with other methods. In an analysis of simulated data, SUMSTAT demonstrates high power and better control of the type I error rate. GSA is a promising approach to the analysis of SNP data in GWAS and use of the SUMSTAT statistic instead of GSEA or FET may increase power and robustness.
Collapse
Affiliation(s)
- Nathan L Tintle
- Department of Mathematics, Hope College, 27 Graves Place, Holland, Michigan 49423, USA.
| | | | | | | |
Collapse
|
37
|
Glazko GV, Emmert-Streib F. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. ACTA ACUST UNITED AC 2009; 25:2348-54. [PMID: 19574285 DOI: 10.1093/bioinformatics/btp406] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Recently, many univariate and several multivariate approaches have been suggested for testing differential expression of gene sets between different phenotypes. However, despite a wealth of literature studying their performance on simulated and real biological data, still there is a need to quantify their relative performance when they are testing different null hypotheses. RESULTS In this article, we compare the performance of univariate and multivariate tests on both simulated and biological data. In the simulation study we demonstrate that high correlations equally affect the power of both, univariate as well as multivariate tests. In addition, for most of them the power is similarly affected by the dimensionality of the gene set and by the percentage of genes in the set, for which expression is changing between two phenotypes. The application of different test statistics to biological data reveals that three statistics (sum of squared t-tests, Hotelling's T(2), N-statistic), testing different null hypotheses, find some common but also some complementing differentially expressed gene sets under specific settings. This demonstrates that due to complementing null hypotheses each test projects on different aspects of the data and for the analysis of biological data it is beneficial to use all three tests simultaneously instead of focusing exclusively on just one.
Collapse
Affiliation(s)
- Galina V Glazko
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA.
| | | |
Collapse
|
38
|
Pounds S, Cheng C, Cao X, Crews KR, Plunkett W, Gandhi V, Rubnitz J, Ribeiro RC, Downing JR, Lamba J. PROMISE: a tool to identify genomic features with a specific biologically interesting pattern of associations with multiple endpoint variables. ACTA ACUST UNITED AC 2009; 25:2013-9. [PMID: 19528086 DOI: 10.1093/bioinformatics/btp357] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION In some applications, prior biological knowledge can be used to define a specific pattern of association of multiple endpoint variables with a genomic variable that is biologically most interesting. However, to our knowledge, there is no statistical procedure designed to detect specific patterns of association with multiple endpoint variables. RESULTS Projection onto the most interesting statistical evidence (PROMISE) is proposed as a general procedure to identify genomic variables that exhibit a specific biologically interesting pattern of association with multiple endpoint variables. Biological knowledge of the endpoint variables is used to define a vector that represents the biologically most interesting values for statistics that characterize the associations of the endpoint variables with a genomic variable. A test statistic is defined as the dot-product of the vector of the observed association statistics and the vector of the most interesting values of the association statistics. By definition, this test statistic is proportional to the length of the projection of the observed vector of correlations onto the vector of most interesting associations. Statistical significance is determined via permutation. In simulation studies and an example application, PROMISE shows greater statistical power to identify genes with the interesting pattern of associations than classical multivariate procedures, individual endpoint analyses or listing genes that have the pattern of interest and are significant in more than one individual endpoint analysis. AVAILABILITY Documented R routines are freely available from www.stjuderesearch.org/depts/biostats and will soon be available as a Bioconductor package from www.bioconductor.org.
Collapse
Affiliation(s)
- Stan Pounds
- Department of Biostatistics, St. Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN 38105, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Thomas R, Gohlke JM, Stopper GF, Parham FM, Portier CJ. Choosing the right path: enhancement of biologically relevant sets of genes or proteins using pathway structure. Genome Biol 2009; 10:R44. [PMID: 19393085 PMCID: PMC2688935 DOI: 10.1186/gb-2009-10-4-r44] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 03/19/2009] [Accepted: 04/24/2009] [Indexed: 01/01/2023] Open
Abstract
A method is proposed that finds enriched pathways relevant to a studied condition using the measured molecular data and also the structural information of the pathway viewed as a network of nodes and edges. Tests are performed using simulated data and genomic data sets and the method is compared to two existing approaches. The analysis provided demonstrates the method proposed is very competitive with the current approaches and also provides biologically relevant results.
Collapse
Affiliation(s)
- Reuben Thomas
- Environmental Systems Biology Group, Laboratory of Molecular Toxicology, National Institute of Environmental Health Sciences, RTP, NC 27709, USA
| | | | | | | | | |
Collapse
|
40
|
Wu Z, Zhao X, Chen L. Identifying responsive functional modules from protein-protein interaction network. Mol Cells 2009; 27:271-7. [PMID: 19326072 DOI: 10.1007/s10059-009-0035-x] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2009] [Accepted: 01/26/2009] [Indexed: 10/21/2022] Open
Abstract
Proteins interact with each other within a cell, and those interactions give rise to the biological function and dynamical behavior of cellular systems. Generally, the protein interactions are temporal, spatial, or condition dependent in a specific cell, where only a small part of interactions usually take place under certain conditions. Recently, although a large amount of protein interaction data have been collected by high-throughput technologies, the interactions are recorded or summarized under various or different conditions and therefore cannot be directly used to identify signaling pathways or active networks, which are believed to work in specific cells under specific conditions. However, protein interactions activated under specific conditions may give hints to the biological process underlying corresponding phenotypes. In particular, responsive functional modules consist of protein interactions activated under specific conditions can provide insight into the mechanism underlying biological systems, e.g. protein interaction subnetworks found for certain diseases rather than normal conditions may help to discover potential biomarkers. From computational viewpoint, identifying responsive functional modules can be formulated as an optimization problem. Therefore, efficient computational methods for extracting responsive functional modules are strongly demanded due to the NP-hard nature of such a combinatorial problem. In this review, we first report recent advances in development of computational methods for extracting responsive functional modules or active pathways from protein interaction network and microarray data. Then from computational aspect, we discuss remaining obstacles and perspectives for this attractive and challenging topic in the area of systems biology.
Collapse
Affiliation(s)
- Zikai Wu
- Institute of Systems Biology, Shanghai University, Shanghai 200444, China
| | | | | |
Collapse
|
41
|
Ma S, Kosorok MR. Identification of differential gene pathways with principal component analysis. ACTA ACUST UNITED AC 2009; 25:882-9. [PMID: 19223452 DOI: 10.1093/bioinformatics/btp085] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Development of high-throughput technology makes it possible to measure expressions of thousands of genes simultaneously. Genes have the inherent pathway structure, where pathways are composed of multiple genes with coordinated biological functions. It is of great interest to identify differential gene pathways that are associated with the variations of phenotypes. RESULTS We propose the following approach for detecting differential gene pathways. First, we construct gene pathways using databases such as KEGG or GO. Second, for each pathway, we extract a small number of representative features, which are linear combinations of gene expressions and/or their transformations. Specifically, we propose using (i) principal components (PCs) of gene expression sets, (ii) PCs of expanded gene expression sets and (iii) expanded sets of PCs of gene expressions, as the representative features. Third, we identify differential gene pathways as those with representative features significantly associated with the variations of phenotypes, particularly disease clinical outcomes, in regression models. The false discovery rate approach is used to adjust for multiple comparisons. Analysis of three gene expression datasets suggests that (i) the proposed approach can effectively identify differential gene pathways; (ii) PCs that explain only a small amount of variations of gene expressions may bear significant associations between gene pathways and phenotypes; (iii) including second-order terms of gene expressions may lead to identification of new differential gene pathways; (iv) the proposed approach is relatively insensitive to additional noises; and (v) the proposed approach can identify gene pathways missed by alternative approaches. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shuangge Ma
- Department of Epidemiology and Public Health, Yale University, New Haven, CT 06510, USA.
| | | |
Collapse
|
42
|
Heller R, Manduchi E, Grant GR, Ewens WJ. A flexible two-stage procedure for identifying gene sets that are differentially expressed. Bioinformatics 2009; 25:1019-25. [PMID: 19213738 DOI: 10.1093/bioinformatics/btp076] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Microarray data analysis has expanded from testing individual genes for differential expression to testing gene sets for differential expression. The tests at the gene set level may focus on multivariate expression changes or on the differential expression of at least one gene in the gene set. These tests may be powerful at detecting subtle changes in expression, but findings at the gene set level need to be examined further to understand whether they are informative and if so how. RESULTS We propose to first test for differential expression at the gene set level but then proceed to test for differential expression of individual genes within discovered gene sets. We introduce the overall false discovery rate (OFDR) as an appropriate error rate to control when testing multiple gene sets and genes. We illustrate the advantage of this procedure over procedures that only test gene sets or individual genes.
Collapse
Affiliation(s)
- Ruth Heller
- Department of Statistics, Wharton School, University of Pennsylvania, Philadelphia, PA 19104-6340, USA.
| | | | | | | |
Collapse
|
43
|
Tintle NL, Best AA, DeJongh M, Van Bruggen D, Heffron F, Porwollik S, Taylor RC. Gene set analyses for interpreting microarray experiments on prokaryotic organisms. BMC Bioinformatics 2008; 9:469. [PMID: 18986519 PMCID: PMC2587482 DOI: 10.1186/1471-2105-9-469] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2008] [Accepted: 11/05/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Despite the widespread usage of DNA microarrays, questions remain about how best to interpret the wealth of gene-by-gene transcriptional levels that they measure. Recently, methods have been proposed which use biologically defined sets of genes in interpretation, instead of examining results gene-by-gene. Despite a serious limitation, a method based on Fisher's exact test remains one of the few plausible options for gene set analysis when an experiment has few replicates, as is typically the case for prokaryotes. RESULTS We extend five methods of gene set analysis from use on experiments with multiple replicates, for use on experiments with few replicates. We then use simulated and real data to compare these methods with each other and with the Fisher's exact test (FET) method. As a result of the simulation we find that a method named MAXMEAN-NR, maintains the nominal rate of false positive findings (type I error rate) while offering good statistical power and robustness to a variety of gene set distributions for set sizes of at least 10. Other methods (ABSSUM-NR or SUM-NR) are shown to be powerful for set sizes less than 10. Analysis of three sets of experimental data shows similar results. Furthermore, the MAXMEAN-NR method is shown to be able to detect biologically relevant sets as significant, when other methods (including FET) cannot. We also find that the popular GSEA-NR method performs poorly when compared to MAXMEAN-NR. CONCLUSION MAXMEAN-NR is a method of gene set analysis for experiments with few replicates, as is common for prokaryotes. Results of simulation and real data analysis suggest that the MAXMEAN-NR method offers increased robustness and biological relevance of findings as compared to FET and other methods, while maintaining the nominal type I error rate.
Collapse
Affiliation(s)
- Nathan L Tintle
- Department of Mathematics, Hope College, Holland, Michigan, USA.
| | | | | | | | | | | | | |
Collapse
|
44
|
Gene expression in BMPR2 mutation carriers with and without evidence of pulmonary arterial hypertension suggests pathways relevant to disease penetrance. BMC Med Genomics 2008; 1:45. [PMID: 18823550 PMCID: PMC2561034 DOI: 10.1186/1755-8794-1-45] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2008] [Accepted: 09/29/2008] [Indexed: 12/22/2022] Open
Abstract
Background While BMPR2 mutation strongly predisposes to pulmonary arterial hypertension (PAH), only 20% of mutation carriers develop clinical disease. This finding suggests that modifier genes contribute to FPAH clinical expression. Since modifiers are likely to be common alleles, this problem is not tractable by traditional genetic approaches. Furthermore, examination of gene expression is complicated by confounding effects attributable to drugs and the disease process itself. Methods To resolve these problems, B-cells were isolated, EBV-immortalized, and cultured from familial PAH patients with BMPR2 mutations, mutation positive but disease-free family members, and family members without mutation. This allows examination of differences in gene expression without drug or disease-related effects. These differences were assayed by Affymetrix array, with follow-up by quantitative RT-PCR and additional statistical analyses. Results By gene array, we found consistent alterations in multiple pathways with known relationship to PAH, including actin organization, immune function, calcium balance, growth, and apoptosis. Selected genes were verified by quantitative RT-PCR using a larger sample set. One of these, CYP1B1, had tenfold lower expression than control groups in female but not male PAH patients. Analysis of overrepresented gene ontology groups suggests that risk of disease correlates with alterations in pathways more strongly than with any specific gene within those pathways. Conclusion Disease status in BMPR2 mutation carriers was correlated with alterations in proliferation, GTP signaling, and stress response pathway expression. The estrogen metabolizing gene CYP1B1 is a strong candidate as a modifier gene in female PAH patients.
Collapse
|
45
|
Conesa A, Bro R, García-García F, Prats JM, Götz S, Kjeldahl K, Montaner D, Dopazo J. Direct functional assessment of the composite phenotype through multivariate projection strategies. Genomics 2008; 92:373-83. [PMID: 18652888 DOI: 10.1016/j.ygeno.2008.05.015] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2008] [Revised: 05/26/2008] [Accepted: 05/28/2008] [Indexed: 01/11/2023]
Abstract
We present a novel approach for the analysis of transcriptomics data that integrates functional annotation of gene sets with expression values in a multivariate fashion, and directly assesses the relation of functional features to a multivariate space of response phenotypical variables. Multivariate projection methods are used to obtain new correlated variables for a set of genes that share a given function. These new functional variables are then related to the response variables of interest. The analysis of the principal directions of the multivariate regression allows for the identification of gene function features correlated with the phenotype. Two different transcriptomics studies are used to illustrate the statistical and interpretative aspects of the methodology. We demonstrate the superiority of the proposed method over equivalent approaches.
Collapse
Affiliation(s)
- Ana Conesa
- Bioinformatics Department, Centro de Investigación Principe Felipe, Valencia, Spain
| | | | | | | | | | | | | | | |
Collapse
|