1
|
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4:249-64. [PMID: 12925520 DOI: 10.1093/biostatistics/4.2.249] [Citation(s) in RCA: 8245] [Impact Index Per Article: 374.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities.
Collapse
|
|
22 |
8245 |
2
|
Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003; 19:185-93. [PMID: 12538238 DOI: 10.1093/bioinformatics/19.2.185] [Citation(s) in RCA: 6152] [Impact Index Per Article: 279.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION When running experiments that involve multiple high density oligonucleotide arrays, it is important to remove sources of variation between arrays of non-biological origin. Normalization is a process for reducing this variation. It is common to see non-linear relations between arrays and the standard normalization provided by Affymetrix does not perform well in these situations. RESULTS We present three methods of performing normalization at the probe intensity level. These methods are called complete data methods because they make use of data from all arrays in an experiment to form the normalizing relation. These algorithms are compared to two methods that make use of a baseline array: a one number scaling based algorithm and a method that uses a non-linear normalizing relation by comparing the variability and bias of an expression measure. Two publicly available datasets are used to carry out the comparisons. The simplest and quickest complete data method is found to perform favorably. AVAILABILITY Software implementing all three of the complete data normalization methods is available as part of the R package Affy, which is a part of the Bioconductor project http://www.bioconductor.org. SUPPLEMENTARY INFORMATION Additional figures may be found at http://www.stat.berkeley.edu/~bolstad/normalize/index.html
Collapse
|
Comparative Study |
22 |
6152 |
3
|
Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O'Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010; 17:98-110. [PMID: 20129251 PMCID: PMC2818769 DOI: 10.1016/j.ccr.2009.12.020] [Citation(s) in RCA: 5499] [Impact Index Per Article: 366.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/02/2009] [Revised: 09/03/2009] [Accepted: 12/04/2009] [Indexed: 12/11/2022]
Abstract
The Cancer Genome Atlas Network recently cataloged recurrent genomic abnormalities in glioblastoma multiforme (GBM). We describe a robust gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrate multidimensional genomic data to establish patterns of somatic mutations and DNA copy number. Aberrations and gene expression of EGFR, NF1, and PDGFRA/IDH1 each define the Classical, Mesenchymal, and Proneural subtypes, respectively. Gene signatures of normal brain cell types show a strong relationship between subtypes and different neural lineages. Additionally, response to aggressive therapy differs by subtype, with the greatest benefit in the Classical subtype and no benefit in the Proneural subtype. We provide a framework that unifies transcriptomic and genomic dimensions for GBM molecular stratification with important implications for future studies.
Collapse
|
Research Support, N.I.H., Extramural |
15 |
5499 |
4
|
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003; 31:e15. [PMID: 12582260 PMCID: PMC150247 DOI: 10.1093/nar/gng015] [Citation(s) in RCA: 3898] [Impact Index Per Article: 177.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
High density oligonucleotide array technology is widely used in many areas of biomedical research for quantitative and highly parallel measurements of gene expression. Affymetrix GeneChip arrays are the most popular. In this technology each gene is typically represented by a set of 11-20 pairs of probes. In order to obtain expression measures it is necessary to summarize the probe level data. Using two extensive spike-in studies and a dilution study, we developed a set of tools for assessing the effectiveness of expression measures. We found that the performance of the current version of the default expression measure provided by Affymetrix Microarray Suite can be significantly improved by the use of probe level summaries derived from empirically motivated statistical models. In particular, improvements in the ability to detect differentially expressed genes are demonstrated.
Collapse
|
research-article |
22 |
3898 |
5
|
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 2002; 30:e15. [PMID: 11842121 PMCID: PMC100354 DOI: 10.1093/nar/30.4.e15] [Citation(s) in RCA: 2433] [Impact Index Per Article: 105.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
There are many sources of systematic variation in cDNA microarray experiments which affect the measured gene expression levels (e.g. differences in labeling efficiency between the two fluorescent dyes). The term normalization refers to the process of removing such variation. A constant adjustment is often used to force the distribution of the intensity log ratios to have a median of zero for each slide. However, such global normalization approaches are not adequate in situations where dye biases can depend on spot overall intensity and/or spatial location within the array. This article proposes normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments. The selection of appropriate controls for normalization is discussed and a novel set of controls (microarray sample pool, MSP) is introduced to aid in intensity-dependent normalization. Lastly, to allow for comparisons of expression levels across slides, a robust method based on maximum likelihood estimation is proposed to adjust for scale differences among slides.
Collapse
|
research-article |
23 |
2433 |
6
|
Dudoit S, Fridlyand J, Speed TP. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J Am Stat Assoc 2002. [DOI: 10.1198/016214502753479248] [Citation(s) in RCA: 1691] [Impact Index Per Article: 73.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
|
23 |
1691 |
7
|
Beissbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004; 20:1464-5. [PMID: 14962934 DOI: 10.1093/bioinformatics/bth088] [Citation(s) in RCA: 933] [Impact Index Per Article: 44.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Modern experimental techniques, as for example DNA microarrays, as a result usually produce a long list of genes, which are potentially interesting in the analyzed process. In order to gain biological understanding from this type of data, it is necessary to analyze the functional annotations of all genes in this list. The Gene-Ontology (GO) database provides a useful tool to annotate and analyze the functions of a large number of genes. Here, we introduce a tool that utilizes this information to obtain an understanding of which annotations are typical for the analyzed list of genes. This program automatically obtains the GO annotations from a database and generates statistics of which annotations are overrepresented in the analyzed list of genes. This results in a list of GO terms sorted by their specificity. AVAILABILITY Our program GOstat is accessible via the Internet at http://gostat.wehi.edu.au
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
933 |
8
|
Savas P, Virassamy B, Ye C, Salim A, Mintoff CP, Caramia F, Salgado R, Byrne DJ, Teo ZL, Dushyanthen S, Byrne A, Wein L, Luen SJ, Poliness C, Nightingale SS, Skandarajah AS, Gyorki DE, Thornton CM, Beavis PA, Fox SB, Darcy PK, Speed TP, Mackay LK, Neeson PJ, Loi S. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat Med 2018; 24:986-993. [PMID: 29942092 DOI: 10.1038/s41591-018-0078-7] [Citation(s) in RCA: 689] [Impact Index Per Article: 98.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Accepted: 04/25/2018] [Indexed: 12/18/2022]
Abstract
The quantity of tumor-infiltrating lymphocytes (TILs) in breast cancer (BC) is a robust prognostic factor for improved patient survival, particularly in triple-negative and HER2-overexpressing BC subtypes1. Although T cells are the predominant TIL population2, the relationship between quantitative and qualitative differences in T cell subpopulations and patient prognosis remains unknown. We performed single-cell RNA sequencing (scRNA-seq) of 6,311 T cells isolated from human BCs and show that significant heterogeneity exists in the infiltrating T cell population. We demonstrate that BCs with a high number of TILs contained CD8+ T cells with features of tissue-resident memory T (TRM) cell differentiation and that these CD8+ TRM cells expressed high levels of immune checkpoint molecules and effector proteins. A CD8+ TRM gene signature developed from the scRNA-seq data was significantly associated with improved patient survival in early-stage triple-negative breast cancer (TNBC) and provided better prognostication than CD8 expression alone. Our data suggest that CD8+ TRM cells contribute to BC immunosurveillance and are the key targets of modulation by immune checkpoint inhibition. Further understanding of the development, maintenance and regulation of TRM cells will be crucial for successful immunotherapeutic development in BC.
Collapse
|
Research Support, Non-U.S. Gov't |
7 |
689 |
9
|
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 2012; 40:e72. [PMID: 22323520 PMCID: PMC3378858 DOI: 10.1093/nar/gks001] [Citation(s) in RCA: 580] [Impact Index Per Article: 44.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
GC content bias describes the dependence between fragment count (read coverage) and GC content found in Illumina sequencing data. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation (DNA-seq). The bias is not consistent between samples; and there is no consensus as to the best methods to remove it in a single sample. We analyze regularities in the GC bias patterns, and find a compact description for this unimodal curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC-rich fragments and AT-rich fragments are underrepresented in the sequencing results. This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias. We propose a model that produces predictions at the base pair level, allowing strand-specific GC-effect correction regardless of the downstream smoothing or binning. These GC modeling considerations can inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
13 |
580 |
10
|
Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Duke S, Garber M, Gentles AJ, Goodstadt L, Heger A, Jurka J, Kamal M, Mauceli E, Searle SMJ, Sharpe T, Baker ML, Batzer MA, Benos PV, Belov K, Clamp M, Cook A, Cuff J, Das R, Davidow L, Deakin JE, Fazzari MJ, Glass JL, Grabherr M, Greally JM, Gu W, Hore TA, Huttley GA, Kleber M, Jirtle RL, Koina E, Lee JT, Mahony S, Marra MA, Miller RD, Nicholls RD, Oda M, Papenfuss AT, Parra ZE, Pollock DD, Ray DA, Schein JE, Speed TP, Thompson K, VandeBerg JL, Wade CM, Walker JA, Waters PD, Webber C, Weidman JR, Xie X, Zody MC, Graves JAM, Ponting CP, Breen M, Samollow PB, Lander ES, Lindblad-Toh K. Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 2007; 447:167-77. [PMID: 17495919 DOI: 10.1038/nature05805] [Citation(s) in RCA: 520] [Impact Index Per Article: 28.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2006] [Accepted: 04/03/2007] [Indexed: 12/15/2022]
Abstract
We report a high-quality draft of the genome sequence of the grey, short-tailed opossum (Monodelphis domestica). As the first metatherian ('marsupial') species to be sequenced, the opossum provides a unique perspective on the organization and evolution of mammalian genomes. Distinctive features of the opossum chromosomes provide support for recent theories about genome evolution and function, including a strong influence of biased gene conversion on nucleotide sequence composition, and a relationship between chromosomal characteristics and X chromosome inactivation. Comparison of opossum and eutherian genomes also reveals a sharp difference in evolutionary innovation between protein-coding and non-coding functional elements. True innovation in protein-coding genes seems to be relatively rare, with lineage-specific differences being largely due to diversification and rapid turnover in gene families involved in environmental interactions. In contrast, about 20% of eutherian conserved non-coding elements (CNEs) are recent inventions that postdate the divergence of Eutheria and Metatheria. A substantial proportion of these eutherian-specific CNEs arose from sequence inserted by transposable elements, pointing to transposons as a major creative force in the evolution of mammalian gene regulation.
Collapse
|
|
18 |
520 |
11
|
Li XY, MacArthur S, Bourgon R, Nix D, Pollard DA, Iyer VN, Hechmer A, Simirenko L, Stapleton M, Hendriks CLL, Chu HC, Ogawa N, Inwood W, Sementchenko V, Beaton A, Weiszmann R, Celniker SE, Knowles DW, Gingeras T, Speed TP, Eisen MB, Biggin MD. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol 2008; 6:e27. [PMID: 18271625 PMCID: PMC2235902 DOI: 10.1371/journal.pbio.0060027] [Citation(s) in RCA: 363] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2007] [Accepted: 12/19/2007] [Indexed: 01/22/2023] Open
Abstract
Identifying the genomic regions bound by sequence-specific regulatory factors is central both to deciphering the complex DNA cis-regulatory code that controls transcription in metazoans and to determining the range of genes that shape animal morphogenesis. We used whole-genome tiling arrays to map sequences bound in Drosophila melanogaster embryos by the six maternal and gap transcription factors that initiate anterior-posterior patterning. We find that these sequence-specific DNA binding proteins bind with quantitatively different specificities to highly overlapping sets of several thousand genomic regions in blastoderm embryos. Specific high- and moderate-affinity in vitro recognition sequences for each factor are enriched in bound regions. This enrichment, however, is not sufficient to explain the pattern of binding in vivo and varies in a context-dependent manner, demonstrating that higher-order rules must govern targeting of transcription factors. The more highly bound regions include all of the over 40 well-characterized enhancers known to respond to these factors as well as several hundred putative new cis-regulatory modules clustered near developmental regulators and other genes with patterned expression at this stage of embryogenesis. The new targets include most of the microRNAs (miRNAs) transcribed in the blastoderm, as well as all major zygotically transcribed dorsal-ventral patterning genes, whose expression we show to be quantitatively modulated by anterior-posterior factors. In addition to these highly bound regions, there are several thousand regions that are reproducibly bound at lower levels. However, these poorly bound regions are, collectively, far more distant from genes transcribed in the blastoderm than highly bound regions; are preferentially found in protein-coding sequences; and are less conserved than highly bound regions. Together these observations suggest that many of these poorly bound regions are not involved in early-embryonic transcriptional regulation, and a significant proportion may be nonfunctional. Surprisingly, for five of the six factors, their recognition sites are not unambiguously more constrained evolutionarily than the immediate flanking DNA, even in more highly bound and presumably functional regions, indicating that comparative DNA sequence analysis is limited in its ability to identify functional transcription factor targets.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
17 |
363 |
12
|
Sargeant TJ, Marti M, Caler E, Carlton JM, Simpson K, Speed TP, Cowman AF. Lineage-specific expansion of proteins exported to erythrocytes in malaria parasites. Genome Biol 2006; 7:R12. [PMID: 16507167 PMCID: PMC1431722 DOI: 10.1186/gb-2006-7-2-r12] [Citation(s) in RCA: 330] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2005] [Revised: 12/20/2005] [Accepted: 01/23/2006] [Indexed: 11/23/2022] Open
Abstract
A new software was used to predict exported proteins that are conserved between malaria parasites infecting rodents and those infecting humans, revealing a lineage-specific expansion of exported proteins. Background The apicomplexan parasite Plasmodium falciparum causes the most severe form of malaria in humans. After invasion into erythrocytes, asexual parasite stages drastically alter their host cell and export remodeling and virulence proteins. Previously, we have reported identification and functional analysis of a short motif necessary for export of proteins out of the parasite and into the red blood cell. Results We have developed software for the prediction of exported proteins in the genus Plasmodium, and identified exported proteins conserved between malaria parasites infecting rodents and the two major causes of human malaria, P. falciparum and P. vivax. This conserved 'exportome' is confined to a few subtelomeric chromosomal regions in P. falciparum and the synteny of these and surrounding regions is conserved in P. vivax. We have identified a novel gene family PHIST (for Plasmodium helical interspersed subtelomeric family) that shares a unique domain with 72 paralogs in P. falciparum and 39 in P. vivax; however, there is only one member in each of the three species studied from the P. berghei lineage. Conclusion These data suggest radiation of genes encoding remodeling and virulence factors from a small number of loci in a common Plasmodium ancestor, and imply a closer phylogenetic relationship between the P. vivax and P. falciparum lineages than previously believed. The presence of a conserved 'exportome' in the genus Plasmodium has important implications for our understanding of both common mechanisms and species-specific differences in host-parasite interactions, and may be crucial in developing novel antimalarial drugs to this infectious disease.
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
330 |
13
|
Abstract
MOTIVATION A classification algorithm, based on a multi-chip, multi-SNP approach is proposed for Affymetrix SNP arrays. Current procedures for calling genotypes on SNP arrays process all the features associated with one chip and one SNP at a time. Using a large training sample where the genotype labels are known, we develop a supervised learning algorithm to obtain more accurate classification results on new data. The method we propose, RLMM, is based on a robustly fitted, linear model and uses the Mahalanobis distance for classification. The chip-to-chip non-biological variance is reduced through normalization. This model-based algorithm captures the similarities across genotype groups and probes, as well as across thousands of SNPs for accurate classification. In this paper, we apply RLMM to Affymetrix 100 K SNP array data, present classification results and compare them with genotype calls obtained from the Affymetrix procedure DM, as well as to the publicly available genotype calls from the HapMap project.
Collapse
|
|
20 |
287 |
14
|
Dugas JC, Tai YC, Speed TP, Ngai J, Barres BA. Functional genomic analysis of oligodendrocyte differentiation. J Neurosci 2006; 26:10967-83. [PMID: 17065439 PMCID: PMC6674672 DOI: 10.1523/jneurosci.2572-06.2006] [Citation(s) in RCA: 264] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
To better understand the molecular mechanisms governing oligodendrocyte (OL) differentiation, we have used gene profiling to quantitatively analyze gene expression in synchronously differentiating OLs generated from pure oligodendrocyte precursor cells in vitro. By comparing gene expression in these OLs to OLs generated in vivo, we discovered that the program of OL differentiation can progress normally in the absence of heterologous cell-cell interactions. In addition, we found that OL differentiation was unexpectedly prolonged and occurred in at least two sequential stages, each characterized by changes in distinct complements of transcription factors and myelin proteins. By disrupting the normal dynamic expression patterns of transcription factors regulated during OL differentiation, we demonstrated that these sequential stages of gene expression can be independently controlled. We also uncovered several genes previously uncharacterized in OLs that encode transmembrane, secreted, and cytoskeletal proteins that are as highly upregulated as myelin genes during OL differentiation. Last, by comparing genomic loci associated with inherited increased risk of multiple sclerosis (MS) to genes regulated during OL differentiation, we identified several new positional candidate genes that may contribute to MS susceptibility. These findings reveal a previously unexpected complexity to OL differentiation and suggest that an intrinsic program governs successive phases of OL differentiation as these cells extend and align their processes, ensheathe, and ultimately myelinate axons.
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
264 |
15
|
Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics 2011; 13:539-52. [PMID: 22101192 DOI: 10.1093/biostatistics/kxr034] [Citation(s) in RCA: 259] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
14 |
259 |
16
|
Broman KW, Speed TP. A model selection approach for the identification of quantitative trait loci in experimental crosses. J R Stat Soc Series B Stat Methodol 2002. [DOI: 10.1111/1467-9868.00354] [Citation(s) in RCA: 258] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
|
23 |
258 |
17
|
Yang YH, Buckley MJ, Dudoit S, Speed TP. Comparison of Methods for Image Analysis on cDNA Microarray Data. J Comput Graph Stat 2002. [DOI: 10.1198/106186002317375640] [Citation(s) in RCA: 225] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
|
23 |
225 |
18
|
Gilad Y, Oshlack A, Smyth GK, Speed TP, White KP. Expression profiling in primates reveals a rapid evolution of human transcription factors. Nature 2006; 440:242-5. [PMID: 16525476 DOI: 10.1038/nature04559] [Citation(s) in RCA: 218] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2005] [Accepted: 12/29/2005] [Indexed: 12/13/2022]
Abstract
Although it has been hypothesized for thirty years that many human adaptations are likely to be due to changes in gene regulation, almost nothing is known about the modes of natural selection acting on regulation in primates. Here we identify a set of genes for which expression is evolving under natural selection. We use a new multi-species complementary DNA array to compare steady-state messenger RNA levels in liver tissues within and between humans, chimpanzees, orangutans and rhesus macaques. Using estimates from a linear mixed model, we identify a set of genes for which expression levels have remained constant across the entire phylogeny (approximately 70 million years), and are therefore likely to be under stabilizing selection. Among the top candidates are five genes with expression levels that have previously been shown to be altered in liver carcinoma. We also find a number of genes with similar expression levels among non-human primates but significantly elevated or reduced expression in the human lineage, features that point to the action of directional selection. Among the gene set with a human-specific increase in expression, there is an excess of transcription factors; the same is not true for genes with increased expression in chimpanzee.
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
218 |
19
|
Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004; 20:323-31. [PMID: 14960458 DOI: 10.1093/bioinformatics/btg410] [Citation(s) in RCA: 214] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION The defining feature of oligonucleotide expression arrays is the use of several probes to assay each targeted transcript. This is a bonanza for the statistical geneticist, who can create probeset summaries with specific characteristics. There are now several methods available for summarizing probe level data from the popular Affymetrix GeneChips, but it is difficult to identify the best method for a given inquiry. RESULTS We have developed a graphical tool to evaluate summaries of Affymetrix probe level data. Plots and summary statistics offer a picture of how an expression measure performs in several important areas. This picture facilitates the comparison of competing expression measures and the selection of methods suitable for a specific investigation. The key is a benchmark data set consisting of a dilution study and a spike-in study. Because the truth is known for these data, we can identify statistical features of the data for which the expected outcome is known in advance. Those features highlighted in our suite of graphs are justified by questions of biological interest and motivated by the presence of appropriate data.
Collapse
|
|
21 |
214 |
20
|
Bastacky J, Lee CY, Goerke J, Koushafar H, Yager D, Kenaga L, Speed TP, Chen Y, Clements JA. Alveolar lining layer is thin and continuous: low-temperature scanning electron microscopy of rat lung. J Appl Physiol (1985) 1995; 79:1615-28. [PMID: 8594022 DOI: 10.1152/jappl.1995.79.5.1615] [Citation(s) in RCA: 206] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The low-temperature electron microscope, which preserves aqueous structures as solid water at liquid nitrogen temperature, was used to image the alveolar lining layer, including surfactant and its aqueous subphase, of air-filled lungs frozen in anesthetized rats at 15-cmH2O transpulmonary pressure. Lining layer thickness was measured on cross fractures of walls of the outermost subpleural alveoli that could be solidified with metal mirror cryofixation at rates sufficient to limit ice crystal growth to 10 nm and prevent appreciable water movement. The thickness of the liquid layer averaged 0.14 micron over relatively flat portions of the alveolar walls, 0.89 micron at the alveolar wall junctions, and 0.09 micron over the protruding features (9 rats, 20 walls, 16 junctions, and 146 areas), for an area-weighted average thickness of 0.2 micron. The alveolar lining layer appears continuous, submerging epithelial cell microvilli and intercellular junctional ridges; varies from a few nanometers to several micrometers in thickness, and serves to smooth the alveolar air-liquid interface in lungs inflated to zone 1 or 2 conditions.
Collapse
|
|
30 |
206 |
21
|
Gilson PR, Nebl T, Vukcevic D, Moritz RL, Sargeant T, Speed TP, Schofield L, Crabb BS. Identification and stoichiometry of glycosylphosphatidylinositol-anchored membrane proteins of the human malaria parasite Plasmodium falciparum. Mol Cell Proteomics 2006; 5:1286-99. [PMID: 16603573 DOI: 10.1074/mcp.m600035-mcp200] [Citation(s) in RCA: 199] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Most proteins that coat the surface of the extracellular forms of the human malaria parasite Plasmodium falciparum are attached to the plasma membrane via glycosylphosphatidylinositol (GPI) anchors. These proteins are exposed to neutralizing antibodies, and several are advanced vaccine candidates. To identify the GPI-anchored proteome of P. falciparum we used a combination of proteomic and computational approaches. Focusing on the clinically relevant blood stage of the life cycle, proteomic analysis of proteins labeled with radioactive glucosamine identified GPI anchoring on 11 proteins (merozoite surface protein (MSP)-1, -2, -4, -5, -10, rhoptry-associated membrane antigen, apical sushi protein, Pf92, Pf38, Pf12, and Pf34). These proteins represent approximately 94% of the GPI-anchored schizont/merozoite proteome and constitute by far the largest validated set of GPI-anchored proteins in this organism. Moreover MSP-1 and MSP-2 were present in similar copy number, and we estimated that together these proteins comprise approximately two-thirds of the total membrane-associated surface coat. This is the first time the stoichiometry of MSPs has been examined. We observed that available software performed poorly in predicting GPI anchoring on P. falciparum proteins where such modification had been validated by proteomics. Therefore, we developed a hidden Markov model (GPI-HMM) trained on P. falciparum sequences and used this to rank all proteins encoded in the completed P. falciparum genome according to their likelihood of being GPI-anchored. GPI-HMM predicted GPI modification on all validated proteins, on several known membrane proteins, and on a number of novel, presumably surface, proteins expressed in the blood, insect, and/or pre-erythrocytic stages of the life cycle. Together this work identified 11 and predicted a further 19 GPI-anchored proteins in P. falciparum.
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
199 |
22
|
Kapp EA, Schütz F, Reid GE, Eddes JS, Moritz RL, O'Hair RAJ, Speed TP, Simpson RJ. Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal Chem 2004; 75:6251-64. [PMID: 14616009 DOI: 10.1021/ac034616t] [Citation(s) in RCA: 191] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
A database of 5500 unique peptide tandem mass spectra acquired in an ion trap mass spectrometer was assembled for peptides derived from proteins digested with trypsin. Peptides were identified initially from their tandem mass spectra by the SEQUEST algorithm and subsequently validated manually. Two different statistical methods were used to identify sequence-dependent fragmentation patterns that could be used to improve fragmentation models incorporated into current peptide sequencing and database search algorithms. The currently accepted "mobile proton" model was expanded to derive a new classification scheme for peptide mass spectra, the "relative proton mobility" scale, which considers peptide ion charge state and amino acid composition to categorize peptide mass spectra into peptide ions containing "nonmobile", "partially mobile", or "mobile" protons. Quantitation of amide bond fragmentation, both N- and C-terminal to any given amino acid, as well as the positional effect of an amino acid in a peptide and peptide length on such fragmentation, has been determined. Peptide bond cleavage propensities, both positive (i.e., enhanced) and negative (i.e., suppressed), were determined and ranked in order of their cleavage preferences as primary, secondary, or tertiary cleavage effects. For example, primary positive cleavage effects were observed for Xaa-Pro and Asp-Xaa bond cleavage for mobile and nonmobile peptide ion categories, respectively. We also report specific pairwise interactions (e.g., Asn-Gly) that result in enhanced amide bond cleavages analogous to those observed in solution-phase chemistry. Peptides classified as nonmobile gave low or insignificant scores, below reported MS/MS score thresholds (cutoff filters), indicating that incorporation of the relative proton mobility scale classification would lead to improvements in current MS/MS scoring functions.
Collapse
|
Journal Article |
21 |
191 |
23
|
Tonkin CJ, Carret CK, Duraisingh MT, Voss TS, Ralph SA, Hommel M, Duffy MF, da Silva LM, Scherf A, Ivens A, Speed TP, Beeson JG, Cowman AF. Sir2 paralogues cooperate to regulate virulence genes and antigenic variation in Plasmodium falciparum. PLoS Biol 2009; 7:e84. [PMID: 19402747 PMCID: PMC2672602 DOI: 10.1371/journal.pbio.1000084] [Citation(s) in RCA: 184] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2008] [Accepted: 03/02/2009] [Indexed: 11/19/2022] Open
Abstract
Cytoadherance of Plasmodium falciparum-infected erythrocytes in the brain, organs and peripheral microvasculature is linked to morbidity and mortality associated with severe malaria. Parasite-derived P. falciparum Erythrocyte Membrane Protein 1 (PfEMP1) molecules displayed on the erythrocyte surface are responsible for cytoadherance and undergo antigenic variation in the course of an infection. Antigenic variation of PfEMP1 is achieved by in situ switching and mutually exclusive transcription of the var gene family, a process that is controlled by epigenetic mechanisms. Here we report characterisation of the P. falciparum silent information regulator's A and B (PfSir2A and PfSir2B) and their involvement in mutual exclusion and silencing of the var gene repertoire. Analysis of P. falciparum parasites lacking either PfSir2A or PfSir2B shows that these NAD(+)-dependent histone deacetylases are required for silencing of different var gene subsets classified by their conserved promoter type. We also demonstrate that in the absence of either of these molecules mutually exclusive expression of var genes breaks down. We show that var gene silencing originates within the promoter and PfSir2 paralogues are involved in cis spreading of silenced chromatin into adjacent regions. Furthermore, parasites lacking PfSir2A but not PfSir2B have considerably longer telomeric repeats, demonstrating a role for this molecule in telomeric end protection. This work highlights the pivotal but distinct role for both PfSir2 paralogues in epigenetic silencing of P. falciparum virulence genes and the control of pathogenicity of malaria infection.
Collapse
|
research-article |
16 |
184 |
24
|
Lin SJ, Gagnon-Bartsch JA, Tan IB, Earle S, Ruff L, Pettinger K, Ylstra B, van Grieken N, Rha SY, Chung HC, Lee JS, Cheong JH, Noh SH, Aoyama T, Miyagi Y, Tsuburaya A, Yoshikawa T, Ajani JA, Boussioutas A, Yeoh KG, Yong WP, So J, Lee J, Kang WK, Kim S, Kameda Y, Arai T, zur Hausen A, Speed TP, Grabsch HI, Tan P. Signatures of tumour immunity distinguish Asian and non-Asian gastric adenocarcinomas. Gut 2015; 64:1721-31. [PMID: 25385008 PMCID: PMC4680172 DOI: 10.1136/gutjnl-2014-308252] [Citation(s) in RCA: 179] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 09/09/2014] [Indexed: 12/12/2022]
Abstract
OBJECTIVE Differences in gastric cancer (GC) clinical outcomes between patients in Asian and non-Asian countries has been historically attributed to variability in clinical management. However, recent international Phase III trials suggest that even with standardised treatments, GC outcomes differ by geography. Here, we investigated gene expression differences between Asian and non-Asian GCs, and if these molecular differences might influence clinical outcome. DESIGN We compared gene expression profiles of 1016 GCs from six Asian and three non-Asian GC cohorts, using a two-stage meta-analysis design and a novel biostatistical method (RUV-4) to adjust for technical variation between cohorts. We further validated our findings by computerised immunohistochemical analysis on two independent tissue microarray (TMA) cohorts from Asian and non-Asian localities (n=665). RESULTS Gene signatures differentially expressed between Asians and non-Asian GCs were related to immune function and inflammation. Non-Asian GCs were significantly enriched in signatures related to T-cell biology, including CTLA-4 signalling. Similarly, in the TMA cohorts, non-Asian GCs showed significantly higher expression of T-cell markers (CD3, CD45R0, CD8) and lower expression of the immunosuppressive T-regulatory cell marker FOXP3 compared to Asian GCs (p<0.05). Inflammatory cell markers CD66b and CD68 also exhibited significant cohort differences (p<0.05). Exploratory analyses revealed a significant relationship between tumour immunity factors, geographic locality-specific prognosis, and postchemotherapy outcomes. CONCLUSIONS Analyses of >1600 GCs suggest that Asian and non-Asian GCs exhibit distinct tumour immunity signatures related to T-cell function. These differences may influence geographical differences in clinical outcome, and the design of future trials particularly in immuno-oncology.
Collapse
|
research-article |
10 |
179 |
25
|
Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 2006; 8:485-99. [PMID: 17189563 DOI: 10.1093/biostatistics/kxl042] [Citation(s) in RCA: 168] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
In most microarray technologies, a number of critical steps are required to convert raw intensity measurements into the data relied upon by data analysts, biologists, and clinicians. These data manipulations, referred to as preprocessing, can influence the quality of the ultimate measurements. In the last few years, the high-throughput measurement of gene expression is the most popular application of microarray technology. For this application, various groups have demonstrated that the use of modern statistical methodology can substantially improve accuracy and precision of the gene expression measurements, relative to ad hoc procedures introduced by designers and manufacturers of the technology. Currently, other applications of microarrays are becoming more and more popular. In this paper, we describe a preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease. In particular, we describe a methodology useful for preprocessing Affymetrix single-nucleotide polymorphism chips and obtaining genotype calls with the preprocessed data. We demonstrate how our procedure improves existing approaches using data from 3 relatively large studies including the one in which large numbers of independent calls are available. The proposed methods are implemented in the package oligo available from Bioconductor.
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
168 |