26
|
Landsman D, Gentleman R, Kelso J, Francis Ouellette BF. DATABASE: A new forum for biological databases and curation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2009; 2009:bap002. [PMID: 20157475 PMCID: PMC2790300 DOI: 10.1093/database/bap002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
27
|
Carey VJ, Gentleman R. Interpreting genetics of gene expression: integrative architecture in Bioconductor. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009:380-90. [PMID: 19209716 PMCID: PMC3378382 DOI: 10.1142/9789812836939_0036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Several influential studies of genotypic determinants of gene expression in humans have now been published based on various populations including HapMap cohorts. The magnitude of the analytic task (transcriptome vs. SNP-genome) is a hindrance to dissemination of efficient, thorough, and auditable inference methods for this project. We describe the structure and use of Bioconductor facilities for inference in genetics of gene expression, with simultaneous application to multiple HapMap cohorts. Tools distributed for this purpose are readily adapted for the structure and analysis of privately-generated data in expression genetics.
Collapse
|
28
|
Kauffmann A, Gentleman R, Huber W. arrayQualityMetrics--a bioconductor package for quality assessment of microarray data. ACTA ACUST UNITED AC 2008; 25:415-6. [PMID: 19106121 PMCID: PMC2639074 DOI: 10.1093/bioinformatics/btn647] [Citation(s) in RCA: 651] [Impact Index Per Article: 40.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
SUMMARY The assessment of data quality is a major concern in microarray analysis. arrayQualityMetrics is a Bioconductor package that provides a report with diagnostic plots for one or two colour microarray data. The quality metrics assess reproducibility, identify apparent outlier arrays and compute measures of signal-to-noise ratio. The tool handles most current microarray technologies and is amenable to use in automated analysis pipelines or for automatic report generation, as well as for use by individuals. The diagnosis of quality remains, in principle, a context-dependent judgement, but our tool provides powerful, automated, objective and comprehensive instruments on which to base a decision. AVAILABILITY arrayQualityMetrics is a free and open source package, under LGPL license, available from the Bioconductor project at www.bioconductor.org. A users guide and examples are provided with the package. Some examples of HTML reports generated by arrayQualityMetrics can be found at http://www.microarray-quality.org
Collapse
|
29
|
Sarkar D, Parkin R, Wyman S, Bendoraite A, Sather C, Delrow J, Godwin AK, Drescher C, Huber W, Gentleman R, Tewari M. Quality assessment and data analysis for microRNA expression arrays. Nucleic Acids Res 2008; 37:e17. [PMID: 19103660 PMCID: PMC2632898 DOI: 10.1093/nar/gkn932] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MicroRNAs are small (approximately 22 nt) RNAs that regulate gene expression and play important roles in both normal and disease physiology. The use of microarrays for global characterization of microRNA expression is becoming increasingly popular and has the potential to be a widely used and valuable research tool. However, microarray profiling of microRNA expression raises a number of data analytic challenges that must be addressed in order to obtain reliable results. We introduce here a universal reference microRNA reagent set as well as a series of nonhuman spiked-in synthetic microRNA controls, and demonstrate their use for quality control and between-array normalization of microRNA expression data. We also introduce diagnostic plots designed to assess and compare various normalization methods. We anticipate that the reagents and analytic approach presented here will be useful for improving the reliability of microRNA microarray experiments.
Collapse
|
30
|
|
31
|
Abstract
BACKGROUND Synthetic lethality defines a genetic interaction where the combination of mutations in two or more genes leads to cell death. The implications of synthetic lethal screens have been discussed in the context of drug development as synthetic lethal pairs could be used to selectively kill cancer cells, but leave normal cells relatively unharmed. A challenge is to assess genome-wide experimental data and integrate the results to better understand the underlying biological processes. We propose statistical and computational tools that can be used to find relationships between synthetic lethality and cellular organizational units. RESULTS In Saccharomyces cerevisiae, we identified multi-protein complexes and pairs of multi-protein complexes that share an unusually high number of synthetic genetic interactions. As previously predicted, we found that synthetic lethality can arise from subunits of an essential multi-protein complex or between pairs of multi-protein complexes. Finally, using multi-protein complexes allowed us to take into account the pleiotropic nature of the gene products. CONCLUSIONS Modeling synthetic lethality using current estimates of the yeast interactome is an efficient approach to disentangle some of the complex molecular interactions that drive a cell. Our model in conjunction with applied statistical methods and computational methods provides new tools to better characterize synthetic genetic interactions.
Collapse
|
32
|
Oron AP, Jiang Z, Gentleman R. Gene set enrichment analysis using linear models and diagnostics. ACTA ACUST UNITED AC 2008; 24:2586-91. [PMID: 18790795 DOI: 10.1093/bioinformatics/btn465] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Gene-set enrichment analysis (GSEA) can be greatly enhanced by linear model (regression) diagnostic techniques. Diagnostics can be used to identify outlying or influential samples, and also to evaluate model fit and explore model expansion. RESULTS We demonstrate this methodology on an adult acute lymphoblastic leukemia (ALL) dataset, using GSEA based on chromosome-band mapping of genes. Individual residuals, grouped or aggregated by chromosomal loci, indicate problematic samples and potential data-entry errors, and help identify hyperdiploidy as a factor playing a key role in expression for this dataset. Subsequent analysis pinpoints suspected DNA copy number abnormalities of specific samples and chromosomes (most prevalent are chromosomes X, 21 and 14), and also reveals significant expression differences between the hyperdiploid and diploid groups on other chromosomes (most prominently 19, 22, 3 and 13)--differences which are apparently not associated with copy number. AVAILABILITY Software for the statistical tools demonstrated in this article is available as Bioconductor package GSEAlm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
33
|
Bar M, Wyman SK, Fritz BR, Qi J, Garg KS, Parkin RK, Kroh EM, Bendoraite A, Mitchell PS, Nelson AM, Ruzzo WL, Ware C, Radich JP, Gentleman R, Ruohola-Baker H, Tewari M. MicroRNA discovery and profiling in human embryonic stem cells by deep sequencing of small RNA libraries. Stem Cells 2008; 26:2496-505. [PMID: 18583537 DOI: 10.1634/stemcells.2008-0356] [Citation(s) in RCA: 235] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
We used massively parallel pyrosequencing to discover and characterize microRNAs (miRNAs) expressed in human embryonic stem cells (hESC). Sequencing of small RNA cDNA libraries derived from undifferentiated hESC and from isogenic differentiating cultures yielded a total of 425,505 high-quality sequence reads. A custom data analysis pipeline delineated expression profiles for 191 previously annotated miRNAs, 13 novel miRNAs, and 56 candidate miRNAs. Further characterization of a subset of the novel miRNAs in Dicer-knockdown hESC demonstrated Dicer-dependent expression, providing additional validation of our results. A set of 14 miRNAs (9 known and 5 novel) was noted to be expressed in undifferentiated hESC and then strongly downregulated with differentiation. Functional annotation analysis of predicted targets of these miRNAs and comparison with a null model using non-hESC-expressed miRNAs identified statistically enriched functional categories, including chromatin remodeling and lineage-specific differentiation annotations. Finally, integration of our data with genome-wide chromatin immunoprecipitation data on OCT4, SOX2, and NANOG binding sites implicates these transcription factors in the regulation of nine of the novel/candidate miRNAs identified here. Comparison of our results with those of recent deep sequencing studies in mouse and human ESC shows that most of the novel/candidate miRNAs found here were not identified in the other studies. The data indicate that hESC express a larger complement of miRNAs than previously appreciated, and they provide a resource for additional studies of miRNA regulation of hESC physiology. Disclosure of potential conflicts of interest is found at the end of this article.
Collapse
|
34
|
Chiang T, Scholtens D, Sarkar D, Gentleman R, Huber W. Coverage and error models of protein-protein interaction data by directed graph analysis. Genome Biol 2008; 8:R186. [PMID: 17845715 PMCID: PMC2375024 DOI: 10.1186/gb-2007-8-9-r186] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2007] [Revised: 05/26/2007] [Accepted: 09/10/2007] [Indexed: 01/10/2023] Open
Abstract
Using a directed graph model for bait to prey systems and a multinomial error model, we assessed the error statistics in all published large-scale datasets for Saccharomyces cerevisiae and characterized them by three traits: the set of tested interactions, artifacts that lead to false-positive or false-negative observations, and estimates of the stochastic error rates that affect the data. These traits provide a prerequisite for the estimation of the protein interactome and its modules.
Collapse
|
35
|
Risk M, Coleman I, Dumpit R, Gentleman R, Kristal AR, Knudsen BS, Nelson PS, Lin DW. Differential gene expression in normal prostate epithelium of men with and without prostate cancer. J Clin Oncol 2008. [DOI: 10.1200/jco.2008.26.15_suppl.5142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
36
|
Abstract
We review the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argue that reports of the low quality of such data are to a substantial extent based on misinterpretations. Probabilistic statistical models and methods can be used to estimate properties of interest and to make the best use of the available data.
Collapse
|
37
|
Abstract
UNLABELLED Automated analysis of flow cytometry (FCM) data is essential for it to become successful as a high throughput technology. We believe that the principles of Trellis graphics can be adapted to provide useful visualizations that can aid such automation. In this article, we describe the R/Bioconductor package flowViz that implements such visualizations. AVAILABILITY flowViz is available as an R package from the Bioconductor project: http://bioconductor.org
Collapse
|
38
|
Scholtens D, Chiang T, Huber W, Gentleman R. Estimating node degree in bait-prey graphs. ACTA ACUST UNITED AC 2007; 24:218-24. [PMID: 18025006 DOI: 10.1093/bioinformatics/btm565] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Proteins work together to drive biological processes in cellular machines. Summarizing global and local properties of the set of protein interactions, the interactome, is necessary for describing cellular systems. We consider a relatively simple per-protein feature of the interactome: the number of interaction partners for a protein, which in graph terminology is the degree of the protein. RESULTS Using data subject to both stochastic and systematic sources of false positive and false negative observations, we develop an explicit probability model and resultant likelihood method to estimate node degree on portions of the interactome assayed by bait-prey technologies. This approach yields substantial improvement in degree estimation over the current practice that naively sums observed edges. Accurate modeling of observed data in relation to true but unknown parameters of interest gives a formal point of reference from which to draw conclusions about the system under study. AVAILABILITY All analyses discussed in this text can be performed using the ppiStats and ppiData packages available through the Bioconductor project (http://www.bioconductor.org).
Collapse
|
39
|
Chiang T, Li N, Orchard S, Kerrien S, Hermjakob H, Gentleman R, Huber W. Rintact: enabling computational analysis of molecular interaction data from the IntAct repository. Bioinformatics 2007; 24:1100-1. [PMID: 17989096 DOI: 10.1093/bioinformatics/btm518] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The IntAct repository is one of the largest and most widely used databases for the curation and storage of molecular interaction data. These datasets need to be analyzed by computational methods. Software packages in the statistical environment R provide powerful tools for conducting such analyses. RESULTS We introduce Rintact, a Bioconductor package that allows users to transform PSI-MI XML2.5 interaction data files from IntAct into R graph objects. On these, they can use methods from R and Bioconductor for a variety of tasks: determining cohesive subgraphs, computing summary statistics, fitting mathematical models to the data or rendering graphical layouts. Rintact provides a programmatic interface to the IntAct repository and allows the use of the analytic methods provided by R and Bioconductor. AVAILABILITY Rintact is freely available at http://bioconductor.org
Collapse
|
40
|
Abstract
Graph theoretical concepts are useful for the description and analysis of interactions and relationships in biological systems. We give a brief introduction into some of the concepts and their areas of application in molecular biology. We discuss software that is available through the Bioconductor project and present a simple example application to the integration of a protein-protein interaction and a co-expression network.
Collapse
|
41
|
Le Meur N, Rossini A, Gasparetto M, Smith C, Brinkman RR, Gentleman R. Data quality assessment of ungated flow cytometry data in high throughput experiments. Cytometry A 2007; 71:393-403. [PMID: 17366638 PMCID: PMC2768034 DOI: 10.1002/cyto.a.20396] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
BACKGROUND The recent development of semiautomated techniques for staining and analyzing flow cytometry samples has presented new challenges. Quality control and quality assessment are critical when developing new high throughput technologies and their associated information services. Our experience suggests that significant bottlenecks remain in the development of high throughput flow cytometry methods for data analysis and display. Especially, data quality control and quality assessment are crucial steps in processing and analyzing high throughput flow cytometry data. METHODS We propose a variety of graphical exploratory data analytic tools for exploring ungated flow cytometry data. We have implemented a number of specialized functions and methods in the Bioconductor package rflowcyt. We demonstrate the use of these approaches by investigating two independent sets of high throughput flow cytometry data. RESULTS We found that graphical representations can reveal substantial nonbiological differences in samples. Empirical Cumulative Distribution Function and summary scatterplots were especially useful in the rapid identification of problems not identified by manual review. CONCLUSIONS Graphical exploratory data analytic tools are quick and useful means of assessing data quality. We propose that the described visualizations should be used as quality assessment tools and where possible, be used for quality control.
Collapse
|
42
|
|
43
|
Carey VJ, Morgan M, Falcon S, Lazarus R, Gentleman R. GGtools: analysis of genetics of gene expression in bioconductor. Bioinformatics 2006; 23:522-3. [PMID: 17158513 DOI: 10.1093/bioinformatics/btl628] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED This paper reviews the central concepts and implementation of data structures and methods for studying genetics of gene expression with the GGtools package of Bioconductor. Illustration with a HapMap+expression dataset is provided. AVAILABILITY Package GGtools is part of Bioconductor 1.9 (http://bioconductor.org). Open source with Artistic License.
Collapse
|
44
|
Abstract
MOTIVATION Gene Set Enrichment Analysis (GSEA) has been developed recently to capture changes in the expression of pre-defined sets of genes. We propose number of extensions to GSEA, including the use of different statistics to describe the association between genes and phenotypes of interest. We make use of dimension reduction procedures, such as principle component analysis, to identify gene sets with correlated expression. We also address issues that arise when gene sets overlap. RESULTS Our proposals extend the range of applicability of GSEA and allow for adjustments based on other covariates. We have provided a well-defined procedure to address interpretation issues that can raise when gene sets have substantial overlap. We have shown how standard dimension reduction methods, such as PCA, can be used to help further interpret GSEA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
45
|
Abstract
MOTIVATION Functional analyses based on the association of Gene Ontology (GO) terms to genes in a selected gene list are useful bioinformatic tools and the GOstats package has been widely used to perform such computations. In this paper we report significant improvements and extensions such as support for conditional testing. RESULTS We discuss the capabilities of GOstats, a Bioconductor package written in R, that allows users to test GO terms for over or under-representation using either a classical hypergeometric test or a conditional hypergeometric that uses the relationships among GO terms to decorrelate the results. AVAILABILITY GOstats is available as an R package from the Bioconductor project: http://bioconductor.org
Collapse
|
46
|
Shi Q, Harris LN, Lu X, Li X, Hwang J, Gentleman R, Iglehart JD, Miron A. Declining Plasma Fibrinogen Alpha Fragment Identifies HER2-Positive Breast Cancer Patients and Reverts to Normal Levels after Surgery. J Proteome Res 2006; 5:2947-55. [PMID: 17081046 DOI: 10.1021/pr060099u] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Breast cancer is the most common nonskin malignancy affecting women. Currently, no simple, blood-based diagnostic test exists to complement radiological screening and increase sensitivity of detection. To screen plasma specimens and identify biomarkers that detect HER2-positive breast cancer, automated robotic sample processing followed by surface-enhanced laser desorption ionization time-of-flight (SELDI-TOF) mass spectroscopy was used. Multiple statistical algorithms were used to select biomarkers that segregate cancer patients versus controls and produced average CV rates ranging from 20% to 29%. A set of seven biomarkers were validated on an independent test data set and achieved the best error rate of 19.1%. A permutation test indicated a p-value for CV error less than 0.002. Moreover, a ROC curve using these biomarkers achieved an area-under-the-curve value of 0.95 on an independent test data set. The marker responsible for most of the resolving power was identified as a fragment of Fibrinogen Alpha (FGA) encompassing residues 605-629. This marker was present at lower levels in cancer patients as compared to controls. The importance of this biomarker was validated in a longitudinal study comparing pre- and post-operative levels and was shown to revert to normal levels after surgery. This fragment may serve as a useful diagnostic and treatment-monitoring marker.
Collapse
|
47
|
Quackenbush J, Stoeckert C, Ball C, Brazma A, Gentleman R, Huber W, Irizarry R, Salit M, Sherlock G, Spellman P, Winegarden N. Top-down standards will not serve systems biology. Nature 2006; 440:24. [PMID: 16511469 DOI: 10.1038/440024a] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
48
|
Gentleman R. Developing Statistical Software inFORTRAN 95. J Stat Softw 2006. [DOI: 10.18637/jss.v017.b02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
|
49
|
Chiaretti S, Li X, Gentleman R, Vitale A, Wang KS, Mandelli F, Foà R, Ritz J. Gene Expression Profiles of B-lineage Adult Acute Lymphocytic Leukemia Reveal Genetic Patterns that Identify Lineage Derivation and Distinct Mechanisms of Transformation. Clin Cancer Res 2005; 11:7209-19. [PMID: 16243790 DOI: 10.1158/1078-0432.ccr-04-2165] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
PURPOSE To characterize gene expression signatures in acute lymphocytic leukemia (ALL) cells associated with known genotypic abnormalities in adult patients. EXPERIMENTAL DESIGN Gene expression profiles from 128 adult patients with newly diagnosed ALL were characterized using high-density oligonucleotide microarrays. All patients were enrolled in the Italian GIMEMA multicenter clinical trial 0496 and samples had >90% leukemic cells. Uniform phenotypic, cytogenetic, and molecular data were also available for all cases. RESULTS T-lineage ALL was characterized by a homogeneous gene expression pattern, whereas several subgroups of B-lineage ALL were evident. Within B-lineage ALL, distinct signatures were associated with ALL1/AF4 and E2A/PBX1 gene rearrangements. Expression profiles associated with ALL1/AF4 and E2A/PBX1 are similar in adults and children. BCR/ABL+ gene expression pattern was more heterogeneous and was most similar to ALL without known molecular rearrangements. We also identified a set of 83 genes that were highly expressed in leukemia blasts from patients without known molecular abnormalities who subsequently relapsed following therapy. Supervised analysis of kinase genes revealed a high-level FLT3 expression in a subset of cases without molecular rearrangements. Two other kinases (PRKCB1 and DDR1) were highly expressed in cases without molecular rearrangements, as well as in BCR/ABL-positive ALL. CONCLUSIONS Genomic signatures are associated with phenotypically and molecularly well defined subgroups of adult ALL. Genomic profiling also identifies genes associated with poor outcome in cases without molecular aberrations and specific genes that may be new therapeutic targets in adult ALL.
Collapse
|
50
|
Tadesse MG, Ibrahim JG, Gentleman R, Chiaretti S, Ritz J, Foa R. Bayesian error-in-variable survival model for the analysis of GeneChip arrays. Biometrics 2005; 61:488-97. [PMID: 16011696 DOI: 10.1111/j.1541-0420.2005.00313.x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
DNA microarrays in conjunction with statistical models may help gain a deeper understanding of the molecular basis for specific diseases. An intense area of research is concerned with the identification of genes related to particular phenotypes. The technology, however, is subject to various sources of error that may lead to expression readings that are substantially different from the true transcript levels. Few methods for microarray data analysis have accounted for measurement error in a substantial way and that is the purpose of this investigation. We describe a Bayesian error-in-variable model for the analysis of microarray data from a clinical study of patients with acute lymphoblastic leukemia. We focus in particular on the problem of identifying genes whose expression patterns are associated with duration of remission. This is a question of great practical interest since relapse is a major concern in the treatment of this disease. We explore the effects of ignoring the uncertainty in the expression estimates on the selection and ranking of genes.
Collapse
|