1
|
Affiliation(s)
- Tianxi Li
- Department of Statistics, University of Virginia, Charllottesville, VA
| | - Lihua Lei
- Department of Statistics, Stanford University, Stanford, CA
| | | | - Koen Van den Berge
- Department of Statistics, University of California, Berkeley, Berkeley, CA
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
| | - Purnamrita Sarkar
- Department of Statistics and Data Sciences, University of Texas at Austin, Austin, TX
| | - Peter J. Bickel
- Department of Statistics, University of California, Berkeley, Berkeley, CA
| | | |
Collapse
|
2
|
Affiliation(s)
- Peter J Bickel
- Department of Statistics, University of California, Berkeley, CA 94720
| |
Collapse
|
3
|
Lei L, Bickel PJ. An assumption-free exact test for fixed-design linear models with exchangeable errors. Biometrika 2020. [DOI: 10.1093/biomet/asaa079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary
We propose the cyclic permutation test to test general linear hypotheses for linear models. The test is nonrandomized and valid in finite samples with exact Type I error $\alpha$ for an arbitrary fixed design matrix and arbitrary exchangeable errors, whenever $1 / \alpha$ is an integer and $n / p \geqslant 1 / \alpha - 1$, where $n$ is the sample size and $p$ is the number of parameters. The test involves applying the marginal rank test to $1 / \alpha$ linear statistics of the outcome vector, where the coefficient vectors are determined by solving a linear system such that the joint distribution of the linear statistics is invariant with respect to a nonstandard cyclic permutation group under the null hypothesis. The power can be further enhanced by solving a secondary nonlinear travelling salesman problem, for which the genetic algorithm can find a reasonably good solution. Extensive simulation studies show that the cyclic permutation test has comparable power to existing tests. When testing for a single contrast of coefficients, an exact confidence interval can be obtained by inverting the test.
Collapse
Affiliation(s)
- Lihua Lei
- Department of Statistics, Stanford University, 202 Sequoia Hall, 390 Serra Mall, Stanford, California 94305, U.S.A
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720, U.S.A
| |
Collapse
|
4
|
Affiliation(s)
- Can M. Le
- Department of Statistics, University of California, Davis, Davis, CA
| | - Keith Levin
- Department of Statistics, University of Michigan, Ann Arbor, MI
| | - Peter J. Bickel
- Department of Statistics, University of California, Berkeley, Berkeley, CA
| | | |
Collapse
|
5
|
Abstract
Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.
Collapse
|
6
|
|
7
|
|
8
|
|
9
|
Stoiber MH, Olson S, May GE, Duff MO, Manent J, Obar R, Guruharsha KG, Bickel PJ, Artavanis-Tsakonas S, Brown JB, Graveley BR, Celniker SE. Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila. Genome Res 2015; 25:1692-702. [PMID: 26294687 PMCID: PMC4617965 DOI: 10.1101/gr.182675.114] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Accepted: 06/10/2015] [Indexed: 01/01/2023]
Abstract
In eukaryotic cells, RNAs exist as ribonucleoprotein particles (RNPs). Despite the importance of these complexes in many biological processes, including splicing, polyadenylation, stability, transportation, localization, and translation, their compositions are largely unknown. We affinity-purified 20 distinct RNA-binding proteins (RBPs) from cultured Drosophila melanogaster cells under native conditions and identified both the RNA and protein compositions of these RNP complexes. We identified “high occupancy target” (HOT) RNAs that interact with the majority of the RBPs we surveyed. HOT RNAs encode components of the nonsense-mediated decay and splicing machinery, as well as RNA-binding and translation initiation proteins. The RNP complexes contain proteins and mRNAs involved in RNA binding and post-transcriptional regulation. Genes with the capacity to produce hundreds of mRNA isoforms, ultracomplex genes, interact extensively with heterogeneous nuclear ribonuclear proteins (hnRNPs). Our data are consistent with a model in which subsets of RNPs include mRNA and protein products from the same gene, indicating the widespread existence of auto-regulatory RNPs. From the simultaneous acquisition and integrative analysis of protein and RNA constituents of RNPs, we identify extensive cross-regulatory and hierarchical interactions in post-transcriptional control.
Collapse
Affiliation(s)
- Marcus H Stoiber
- Department of Biostatistics, University of California Berkeley, Berkeley, California 94720, USA; Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Sara Olson
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030, USA
| | - Gemma E May
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030, USA
| | - Michael O Duff
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030, USA
| | - Jan Manent
- Department of Cell Biology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Robert Obar
- Department of Cell Biology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - K G Guruharsha
- Department of Cell Biology, Harvard Medical School, Boston, Massachusetts 02115, USA; Biogen Incorporated, Cambridge, Massachusetts 02142, USA
| | - Peter J Bickel
- Department of Biostatistics, University of California Berkeley, Berkeley, California 94720, USA
| | - Spyros Artavanis-Tsakonas
- Department of Cell Biology, Harvard Medical School, Boston, Massachusetts 02115, USA; Biogen Incorporated, Cambridge, Massachusetts 02142, USA
| | - James B Brown
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA; Department of Statistics, University of California Berkeley, Berkeley, California 94720, USA
| | - Brenton R Graveley
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030, USA
| | - Susan E Celniker
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| |
Collapse
|
10
|
|
11
|
|
12
|
Wang YXR, Jiang K, Feldman LJ, Bickel PJ, Huang H. Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis. Ann Appl Stat 2015. [DOI: 10.1214/14-aoas792] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
Chen ZX, Sturgill D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim YA, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M, Muzny DM, Ongeri F, Perales L, Przytycka TM, Pu LL, Robinson G, Thornton RL, Saada N, Scherer SE, Smith HE, Vinson C, Warner CB, Worley KC, Wu YQ, Zou X, Cherbas P, Kellis M, Eisen MB, Piano F, Kionte K, Fitch DH, Sternberg PW, Cutter AD, Duff MO, Hoskins RA, Graveley BR, Gibbs RA, Bickel PJ, Kopp A, Carninci P, Celniker SE, Oliver B, Richards S. Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 2015; 24:1209-23. [PMID: 24985915 PMCID: PMC4079975 DOI: 10.1101/gr.159384.113] [Citation(s) in RCA: 111] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.
Collapse
Affiliation(s)
- Zhen-Xia Chen
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - David Sturgill
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Jiaxin Qu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Huaiyang Jiang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Soo Park
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Nathan Boley
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Ana Maria Suzuki
- Technology Development Group, RIKEN Omics Science Center and RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Yokohama City, Kanagawa, Japan 230-0045
| | - Anthony R Fletcher
- Division of Computational Bioscience, Center For Information Technology, National Institutes of Health, Bethesda, Maryland 20814, USA
| | - David C Plachetzki
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Peter C FitzGerald
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Carlo G Artieri
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Joel Atallah
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Olga Barmina
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - James B Brown
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Kerstin P Blankenburg
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Emily Clough
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Abhijit Dasgupta
- Clinical Trials and Outcomes Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Sai Gubbala
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yi Han
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Joy C Jayaseelan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Divya Kalra
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yoo-Ah Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Christie L Kovar
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Sandra L Lee
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Mingmei Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - James D Malley
- Division of Computational Bioscience, Center For Information Technology, National Institutes of Health, Bethesda, Maryland 20814, USA
| | - John H Malone
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Tittu Mathew
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Nicolas R Mattiuzzo
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Mala Munidasa
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Fiona Ongeri
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Lora Perales
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Ling-Ling Pu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Garrett Robinson
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Rebecca L Thornton
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Nehad Saada
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Steven E Scherer
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Harold E Smith
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Charles Vinson
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Crystal B Warner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Kim C Worley
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yuan-Qing Wu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Xiaoyan Zou
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Peter Cherbas
- Department of Biology, Indiana University, Bloomington, Indiana 47405, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 20139, USA
| | - Michael B Eisen
- Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
| | - Fabio Piano
- Department of Biology, New York University, New York, New York 10003, USA
| | - Karin Kionte
- Department of Biology, New York University, New York, New York 10003, USA
| | - David H Fitch
- Department of Biology, New York University, New York, New York 10003, USA
| | - Paul W Sternberg
- HHMI and Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Asher D Cutter
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, M5S 3B2, Canada
| | - Michael O Duff
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Roger A Hoskins
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Brenton R Graveley
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Artyom Kopp
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Piero Carninci
- Technology Development Group, RIKEN Omics Science Center and RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Yokohama City, Kanagawa, Japan 230-0045
| | - Susan E Celniker
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Brian Oliver
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Stephen Richards
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| |
Collapse
|
14
|
Li JJ, Huang H, Bickel PJ, Brenner SE. Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome Res 2015; 24:1086-101. [PMID: 24985912 PMCID: PMC4079965 DOI: 10.1101/gr.170100.113] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data. We focus on “stage-associated genes” that capture specific transcriptional activities in each stage and use them to map pairwise stages within and between the two species by a hypergeometric test. Within each species, temporally adjacent stages exhibit high transcriptome similarity, as expected. Additionally, fly female adults and worm adults are mapped with fly and worm embryos, respectively, due to maternal gene expression. Between fly and worm, an unexpected strong collinearity is observed in the time course from early embryos to late larvae. Moreover, a second parallel pattern is found between fly prepupae through adults and worm late embryos through adults, consistent with the second large wave of cell proliferation and differentiation in the fly life cycle. The results indicate a partially duplicated developmental program in fly. Our results constitute the first comprehensive comparison between D. melanogaster and C. elegans developmental time courses and provide new insights into similarities in their development . We use an analogous approach to compare tissues and cells from fly and worm. Findings include strong transcriptome similarity of fly cell lines, clustering of fly adult tissues by origin regardless of sex and age, and clustering of worm tissues and dissected cells by developmental stage. Gene ontology analysis supports our results and gives a detailed functional annotation of different stages, tissues and cells. Finally, we show that standard correlation analyses could not effectively detect the mappings found by our method.
Collapse
Affiliation(s)
- Jingyi Jessica Li
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
15
|
|
16
|
Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, Pei B, Harmanci AO, Duff MO, Djebali S, Alexander RP, Alver BH, Auerbach R, Bell K, Bickel PJ, Boeck ME, Boley NP, Booth BW, Cherbas L, Cherbas P, Di C, Dobin A, Drenkow J, Ewing B, Fang G, Fastuca M, Feingold EA, Frankish A, Gao G, Good PJ, Guigó R, Hammonds A, Harrow J, Hoskins RA, Howald C, Hu L, Huang H, Hubbard TJP, Huynh C, Jha S, Kasper D, Kato M, Kaufman TC, Kitchen RR, Ladewig E, Lagarde J, Lai E, Leng J, Lu Z, MacCoss M, May G, McWhirter R, Merrihew G, Miller DM, Mortazavi A, Murad R, Oliver B, Olson S, Park PJ, Pazin MJ, Perrimon N, Pervouchine D, Reinke V, Reymond A, Robinson G, Samsonova A, Saunders GI, Schlesinger F, Sethi A, Slack FJ, Spencer WC, Stoiber MH, Strasbourger P, Tanzer A, Thompson OA, Wan KH, Wang G, Wang H, Watkins KL, Wen J, Wen K, Xue C, Yang L, Yip K, Zaleski C, Zhang Y, Zheng H, Brenner SE, Graveley BR, Celniker SE, Gingeras TR, Waterston R. Comparative analysis of the transcriptome across distant species. Nature 2014; 512:445-8. [PMID: 25164755 PMCID: PMC4155737 DOI: 10.1038/nature13424] [Citation(s) in RCA: 239] [Impact Index Per Article: 23.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2013] [Accepted: 04/30/2014] [Indexed: 12/30/2022]
Abstract
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Collapse
Affiliation(s)
- Mark B Gerstein
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3] Department of Computer Science, Yale University, 51 Prospect Street, New Haven, Connecticut 06511, USA [4] [5]
| | - Joel Rozowsky
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]
| | - Koon-Kiu Yan
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]
| | - Daifeng Wang
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]
| | - Chao Cheng
- 1] Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire 03755, USA [2] Institute for Quantitative Biomedical Sciences, Norris Cotton Cancer Center, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire 03766, USA [3]
| | - James B Brown
- 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2] Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA [3]
| | - Carrie A Davis
- 1] Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA [2]
| | - LaDeana Hillier
- 1] Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA [2]
| | - Cristina Sisu
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]
| | - Jingyi Jessica Li
- 1] Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA [2] Department of Statistics, University of California, Los Angeles, California 90095-1554, USA [3] Department of Human Genetics, University of California, Los Angeles, California 90095-7088, USA [4]
| | - Baikang Pei
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]
| | - Arif O Harmanci
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]
| | - Michael O Duff
- 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2]
| | - Sarah Djebali
- 1] Centre for Genomic Regulation, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain [2] Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain [3]
| | - Roger P Alexander
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Burak H Alver
- Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, Massachusetts 02115, USA
| | - Raymond Auerbach
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Kimberly Bell
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA
| | - Max E Boeck
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - Nathan P Boley
- 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2] Department of Biostatistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA
| | - Benjamin W Booth
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Lucy Cherbas
- 1] Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA [2] Center for Genomics and Bioinformatics, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA
| | - Peter Cherbas
- 1] Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA [2] Center for Genomics and Bioinformatics, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA
| | - Chao Di
- MOE Key Lab of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Alex Dobin
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Jorg Drenkow
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Brent Ewing
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - Gang Fang
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Megan Fastuca
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Elise A Feingold
- National Human Genome Research Institute, National Institutes of Health, 5635 Fishers Lane, Bethesda, Maryland 20892-9307, USA
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Guanjun Gao
- MOE Key Lab of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Peter J Good
- National Human Genome Research Institute, National Institutes of Health, 5635 Fishers Lane, Bethesda, Maryland 20892-9307, USA
| | - Roderic Guigó
- 1] Centre for Genomic Regulation, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain [2] Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
| | - Ann Hammonds
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Jen Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Roger A Hoskins
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Cédric Howald
- 1] Center for Integrative Genomics, University of Lausanne, Genopode building, Lausanne 1015, Switzerland [2] Swiss Institute of Bioinformatics, Genopode building, Lausanne 1015, Switzerland
| | - Long Hu
- MOE Key Lab of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA
| | - Tim J P Hubbard
- 1] Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK [2] Medical and Molecular Genetics, King's College London, London WC2R 2LS, UK
| | - Chau Huynh
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - Sonali Jha
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Dionna Kasper
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520-8005, USA
| | - Masaomi Kato
- Department of Molecular, Cellular and Developmental Biology, PO Box 208103, Yale University, New Haven, Connecticut 06520, USA
| | - Thomas C Kaufman
- Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA
| | - Robert R Kitchen
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Erik Ladewig
- Sloan-Kettering Institute, 1275 York Avenue, Box 252, New York, New York 10065, USA
| | - Julien Lagarde
- 1] Centre for Genomic Regulation, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain [2] Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
| | - Eric Lai
- Sloan-Kettering Institute, 1275 York Avenue, Box 252, New York, New York 10065, USA
| | - Jing Leng
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Zhi Lu
- MOE Key Lab of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Michael MacCoss
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - Gemma May
- 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2] Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 USA
| | - Rebecca McWhirter
- Department of Cell and Developmental Biology, Vanderbilt University, 465 21st Avenue South, Nashville, Tennessee 37232-8240, USA
| | - Gennifer Merrihew
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - David M Miller
- Department of Cell and Developmental Biology, Vanderbilt University, 465 21st Avenue South, Nashville, Tennessee 37232-8240, USA
| | - Ali Mortazavi
- 1] Developmental and Cell Biology, University of California, Irvine, California 92697, USA [2] Center for Complex Biological Systems, University of California, Irvine, California 92697, USA
| | - Rabi Murad
- 1] Developmental and Cell Biology, University of California, Irvine, California 92697, USA [2] Center for Complex Biological Systems, University of California, Irvine, California 92697, USA
| | - Brian Oliver
- Section of Developmental Genomics, Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Sara Olson
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA
| | - Peter J Park
- Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, Massachusetts 02115, USA
| | - Michael J Pazin
- National Human Genome Research Institute, National Institutes of Health, 5635 Fishers Lane, Bethesda, Maryland 20892-9307, USA
| | - Norbert Perrimon
- 1] Department of Genetics and Drosophila RNAi Screening Center, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA [2] Howard Hughes Medical Institute, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA
| | - Dmitri Pervouchine
- 1] Centre for Genomic Regulation, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain [2] Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
| | - Valerie Reinke
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520-8005, USA
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Genopode building, Lausanne 1015, Switzerland
| | - Garrett Robinson
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA
| | - Anastasia Samsonova
- 1] Department of Genetics and Drosophila RNAi Screening Center, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA [2] Howard Hughes Medical Institute, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA
| | - Gary I Saunders
- 1] Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK [2] European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK
| | - Felix Schlesinger
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Anurag Sethi
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Frank J Slack
- Department of Molecular, Cellular and Developmental Biology, PO Box 208103, Yale University, New Haven, Connecticut 06520, USA
| | - William C Spencer
- Department of Cell and Developmental Biology, Vanderbilt University, 465 21st Avenue South, Nashville, Tennessee 37232-8240, USA
| | - Marcus H Stoiber
- 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2] Department of Biostatistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA
| | - Pnina Strasbourger
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - Andrea Tanzer
- 1] Bioinformatics and Genomics Programme, Center for Genomic Regulation, Universitat Pompeu Fabra (CRG-UPF), 08003 Barcelona, Catalonia, Spain [2] Institute for Theoretical Chemistry, Theoretical Biochemistry Group (TBI), University of Vienna, Währingerstrasse 17/3/303, A-1090 Vienna, Austria
| | - Owen A Thompson
- Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA
| | - Kenneth H Wan
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Guilin Wang
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520-8005, USA
| | - Huaien Wang
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Kathie L Watkins
- Department of Cell and Developmental Biology, Vanderbilt University, 465 21st Avenue South, Nashville, Tennessee 37232-8240, USA
| | - Jiayu Wen
- Sloan-Kettering Institute, 1275 York Avenue, Box 252, New York, New York 10065, USA
| | - Kejia Wen
- MOE Key Lab of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Chenghai Xue
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Li Yang
- 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2] Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Kevin Yip
- 1] Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong [2] 5 CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Chris Zaleski
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Yan Zhang
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Henry Zheng
- 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA
| | - Steven E Brenner
- 1] Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA [2] Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA [3]
| | - Brenton R Graveley
- 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2]
| | - Susan E Celniker
- 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2]
| | - Thomas R Gingeras
- 1] Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA [2]
| | - Robert Waterston
- 1] Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA [2]
| |
Collapse
|
17
|
Brown JB, Boley N, Eisman R, May GE, Stoiber MH, Duff MO, Booth BW, Wen J, Park S, Suzuki AM, Wan KH, Yu C, Zhang D, Carlson JW, Cherbas L, Eads BD, Miller D, Mockaitis K, Roberts J, Davis CA, Frise E, Hammonds AS, Olson S, Shenker S, Sturgill D, Samsonova AA, Weiszmann R, Robinson G, Hernandez J, Andrews J, Bickel PJ, Carninci P, Cherbas P, Gingeras TR, Hoskins RA, Kaufman TC, Lai EC, Oliver B, Perrimon N, Graveley BR, Celniker SE. Diversity and dynamics of the Drosophila transcriptome. Nature 2014; 512:393-9. [PMID: 24670639 PMCID: PMC4152413 DOI: 10.1038/nature12962] [Citation(s) in RCA: 470] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2013] [Accepted: 12/18/2013] [Indexed: 01/10/2023]
Abstract
Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.
Collapse
|
18
|
Boley N, Stoiber MH, Booth BW, Wan KH, Hoskins RA, Bickel PJ, Celniker SE, Brown JB. Genome-guided transcript assembly by integrative analysis of RNA sequence data. Nat Biotechnol 2014; 32:341-6. [PMID: 24633242 PMCID: PMC4037530 DOI: 10.1038/nbt.2850] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Accepted: 02/11/2014] [Indexed: 01/31/2023]
Abstract
The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in the annotation of genomes. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call Generalized RNA Integration Tool, or GRIT. Applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recovered the vast majority of previously annotated transcripts and doubled the total number of transcripts cataloged. We found that 20% of protein coding genes encode multiple protein-localization signals and that, in 20-d-old adult fly heads, genes with multiple polyadenylation sites are more common than genes with alternative splicing or alternative promoters. GRIT demonstrates 30% higher precision and recall than the most widely used transcript assembly tools. GRIT will facilitate the automated generation of high-quality genome annotations without the need for extensive manual annotation.
Collapse
Affiliation(s)
- Nathan Boley
- Department of Biostatistics, University of California at Berkeley, Berkeley, CA, USA
| | - Marcus H. Stoiber
- Department of Biostatistics, University of California at Berkeley, Berkeley, CA, USA
| | - Benjamin W. Booth
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Kenneth H. Wan
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Roger A. Hoskins
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Peter J. Bickel
- Department of Statistics, University of California at Berkeley, Berkeley, CA, USA
| | - Susan E. Celniker
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - James B. Brown
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- Department of Statistics, University of California at Berkeley, Berkeley, CA, USA
| |
Collapse
|
19
|
Abstract
modENCODE was a 5year NHGRI funded project (2007-2012) to map the function of every base in the genomes of worms and flies characterizing positions of modified histones and other chromatin marks, origins of DNA replication, RNA transcripts and the transcription factor binding sites that control gene expression. Here we describe the Drosophila modENCODE datasets and how best to access and use them for genome wide and individual gene studies.
Collapse
Affiliation(s)
- Nathan Boley
- Department of Biostatistics, University of California Berkeley, Berkeley, CA, United States
| | - Kenneth H Wan
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - Peter J Bickel
- Department of Statistics, University of California Berkeley, Berkeley, CA, United States
| | - Susan E Celniker
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, CA, United States.
| |
Collapse
|
20
|
Li JJ, Bickel PJ, Biggin MD. System wide analyses have underestimated protein abundances and the importance of transcription in mammals. PeerJ 2014; 2:e270. [PMID: 24688849 PMCID: PMC3940484 DOI: 10.7717/peerj.270] [Citation(s) in RCA: 207] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 01/22/2014] [Indexed: 12/17/2022] Open
Abstract
Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000–16,000 molecules per cell and that differences in mRNA expression between genes explain only 10–40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transcription. Using individual measurements for 61 housekeeping proteins to rescale whole proteome data from Schwanhausser et al. (2011), we find that the median protein detected is expressed at 170,000 molecules per cell and that our corrected protein abundance estimates show a higher correlation with mRNA abundances than do the uncorrected protein data. In addition, we estimated the impact of further errors in mRNA and protein abundances using direct experimental measurements of these errors. The resulting analysis suggests that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwanhausser et al. (2011), though because one major source of error could not be estimated the true percent contribution should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhausser et al. (2011), and that the measured and inferred translation rates correlate poorly (R2 = 0.13). Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. We also determined the percent contributions of transcription, RNA degradation, translation and protein degradation to the variance in protein abundances using both of our strategies. While the magnitudes of the two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role. Finally, the above estimates only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.
Collapse
Affiliation(s)
- Jingyi Jessica Li
- Department of Statistics, University of California , Berkeley, CA , USA ; Departments of Statistics and Human Genetics, University of California , Los Angeles, CA , USA
| | - Peter J Bickel
- Department of Statistics, University of California , Berkeley, CA , USA
| | - Mark D Biggin
- Genomics Division, Lawrence Berkeley National Laboratory , Berkeley, CA , USA
| |
Collapse
|
21
|
|
22
|
Bickel PJ, Cai M. Discussion of Sara van de Geer: Generic chaining and the L1 penalty. J Stat Plan Inference 2013. [DOI: 10.1016/j.jspi.2012.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
23
|
Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci U S A 2011; 108:19867-72. [PMID: 22135461 PMCID: PMC3250192 DOI: 10.1073/pnas.1113972108] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.
Collapse
Affiliation(s)
- Jingyi Jessica Li
- Department of Statistics, University of California, Berkeley, CA 94720; and
| | - Ci-Ren Jiang
- Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709-4006
| | - James B. Brown
- Department of Statistics, University of California, Berkeley, CA 94720; and
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, CA 94720; and
| | - Peter J. Bickel
- Department of Statistics, University of California, Berkeley, CA 94720; and
| |
Collapse
|
24
|
Polishchuk MS, Brown JB, Favorov AV, Bickel PJ, Tumanian VG. [Splice sites are overrepresented in Pasilla binding motif clusters in D. melanogaster genes]. Biofizika 2011; 56:1065-1070. [PMID: 22279750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
For RNA-binding protein Pasilla, which has been shown to play a role in alternative splicing regulation, binding sites and clusters of binding sites are found in silico in the whole genome of D. melanogaster. The current study analyzes the occurrence of splice sites in binding site clusters. Several hundred thousand binding site motifs and thousands of significant motif clusters were identified. It was discovered that exon-intron borders in D. melanogaster genes are reliably found within Pasilla binding motif clusters, with a higher frequency than could be otherwise expected based on a random model. Additionally, donor splice sites are found in Pasilla clusters twice as often as acceptor sites. This phenomena is observed both for exons annotated as alternatively spliced and for exons annotated as constitutive. These observations support the hypothesis that Pasilla plays a functional role in splicing regulation of D. melanogaster.
Collapse
|
25
|
|
26
|
|
27
|
Bickel PJ, Gel YR. Banded regularization of autocovariance matrices in application to parameter estimation and forecasting of time series. J R Stat Soc Series B Stat Methodol 2011. [DOI: 10.1111/j.1467-9868.2011.00779.x] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
28
|
Hoskins RA, Landolin JM, Brown JB, Sandler JE, Takahashi H, Lassmann T, Yu C, Booth BW, Zhang D, Wan KH, Yang L, Boley N, Andrews J, Kaufman TC, Graveley BR, Bickel PJ, Carninci P, Carlson JW, Celniker SE. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res 2010; 21:182-92. [PMID: 21177961 DOI: 10.1101/gr.112466.110] [Citation(s) in RCA: 167] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Core promoters are critical regions for gene regulation in higher eukaryotes. However, the boundaries of promoter regions, the relative rates of initiation at the transcription start sites (TSSs) distributed within them, and the functional significance of promoter architecture remain poorly understood. We produced a high-resolution map of promoters active in the Drosophila melanogaster embryo by integrating data from three independent and complementary methods: 21 million cap analysis of gene expression (CAGE) tags, 1.2 million RNA ligase mediated rapid amplification of cDNA ends (RLM-RACE) reads, and 50,000 cap-trapped expressed sequence tags (ESTs). We defined 12,454 promoters of 8037 genes. Our analysis indicates that, due to non-promoter-associated RNA background signal, previous studies have likely overestimated the number of promoter-associated CAGE clusters by fivefold. We show that TSS distributions form a complex continuum of shapes, and that promoters active in the embryo and adult have highly similar shapes in 95% of cases. This suggests that these distributions are generally determined by static elements such as local DNA sequence and are not modulated by dynamic signals such as histone modifications. Transcription factor binding motifs are differentially enriched as a function of promoter shape, and peaked promoter shape is correlated with both temporal and spatial regulation of gene expression. Our results contribute to the emerging view that core promoters are functionally diverse and control patterning of gene expression in Drosophila and mammals.
Collapse
Affiliation(s)
- Roger A Hoskins
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 97420, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Bickel PJ. Leo Breiman: An important intellectual and personal force in statistics, my life and that of many others. Ann Appl Stat 2010. [DOI: 10.1214/10-aoas404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
30
|
|
31
|
Abstract
Examination of aggregate data on graduate admissions to the University of California, Berkeley, for fall 1973 shows a clear but misleading pattern of bias against female applicants. Examination of the disaggregated data reveals few decision-making units that show statistically significant departures from expected frequencies of female admissions, and about as many units appear to favor women as to favor men. If the data are properly pooled, taking into account the autonomy of departmental decision making, thus correcting for the tendency of women to apply to graduate departments that are more difficult for applicants of either sex to enter, there is a small but statistically significant bias in favor of women. The graduate departments that are easier to enter tend to be those that require more mathematics in the undergraduate preparatory curriculum. The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.
Collapse
|
32
|
|
33
|
Bickel PJ, Brown JB, Huang H, Li Q. An overview of recent developments in genomics and associated statistical methods. Philos Trans A Math Phys Eng Sci 2009; 367:4313-37. [PMID: 19805447 DOI: 10.1098/rsta.2009.0164] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensive sequencing has shifted the primary focus from the acquisition of biological sequences to the study of biological function. Assays have been developed to study many intricacies of biological systems, and publicly available databases have given rise to integrative analyses that combine information from many sources to draw complex conclusions. Such research was the focus of the recent workshop at the Isaac Newton Institute, 'High dimensional statistics in biology'. Many computational methods from modern genomics and related disciplines were presented and discussed. Using, as much as possible, the material from these talks, we give an overview of modern genomics: from the essential assays that make data-generation possible, to the statistical methods that yield meaningful inference. We point to current analytical challenges, where novel methods, or novel applications of extant methods, are presently needed.
Collapse
Affiliation(s)
- Peter J Bickel
- Department of Statistics University of California, Berkeley, CA, USA
| | | | | | | |
Collapse
|
34
|
|
35
|
|
36
|
Bickel PJ, Ritov Y. Discussion of: Treelets—An adaptive multi-scale basis for sparse unordered data. Ann Appl Stat 2008. [DOI: 10.1214/08-aoas137b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
37
|
|
38
|
|
39
|
|
40
|
|
41
|
|
42
|
|
43
|
Kechris KJ, Lin JC, Bickel PJ, Glazer AN. Quantitative exploration of the occurrence of lateral gene transfer by using nitrogen fixation genes as a case study. Proc Natl Acad Sci U S A 2006; 103:9584-9. [PMID: 16769896 PMCID: PMC1480450 DOI: 10.1073/pnas.0603534103] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Lateral gene transfer (LGT) is now accepted as an important factor in the evolution of prokaryotes. Establishment of the occurrence of LGT is typically attempted by a variety of methods that includes the comparison of reconstructed phylogenetic trees, the search for unusual GC composition or codon usage within a genome, and identification of similarities between distant species as determined by best blast hits. We explore quantitative assessments of these strategies to study the prokaryotic trait of nitrogen fixation, the enzyme-catalyzed reduction of N(2) to ammonia. Phylogenies constructed on nitrogen fixation genes are not in agreement with the tree-of-life based on 16S rRNA but do not conclusively distinguish between gene loss and LGT hypotheses. Using a series of analyses on a set of complete genomes, our results distinguish two structurally distinct classes of MoFe nitrogenases whose distribution cuts across lines of vertical inheritance and makes us believe that a conclusive case for LGT has been made.
Collapse
Affiliation(s)
- Katherina J. Kechris
- *Department of Biochemistry and Biophysics, University of California, 600 16th Street, Box 2240, San Francisco, CA 94143
- To whom correspondence may be sent at the present address:
Department of Preventive Medicine and Biometrics, University of Colorado at Denver and Health Sciences Center, 4200 East 9th Avenue, B-119, Denver, CO 80262. E-mail:
| | - Jason C. Lin
- Department of Statistics, University of California, 367 Evans Hall #3860, Berkeley, CA 94720; and
| | - Peter J. Bickel
- Department of Statistics, University of California, 367 Evans Hall #3860, Berkeley, CA 94720; and
- To whom correspondence may be addressed. E-mail:
or
| | - Alexander N. Glazer
- Department of Molecular and Cell Biology, University of California, 142 LSA #3200, Berkeley, CA 94720
- To whom correspondence may be addressed. E-mail:
or
| |
Collapse
|
44
|
van Zwet EW, Kechris KJ, Bickel PJ, Eisen MB. Estimating motifs under order restrictions. Stat Appl Genet Mol Biol 2006; 4:Article1. [PMID: 16646826 DOI: 10.2202/1544-6115.1100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Transcription factors and many other DNA-binding proteins recognize more than one specific sequence. Among sequences recognized by a given DNA-binding protein, different positions exhibit varying degrees of conservation. The reason is that base pairs that are more extensively contacted by the protein tend to be more conserved. This observation can be used in the discovery of transcription factor binding sites. Here we present a rigorous means to accomplish this. In particular, we constrain the order of the information (entropy) in the columns of the position specific weight matrix (PWM) which characterizes the motif being sought. We then show how to compute the maximum likelihood estimate of a PWM under such order restrictions. This computation is easily integrated with the EM algorithm or the Gibbs sampler to enhance performance in the search for motifs in unaligned sequences. We demonstrate our method on a well-known data set of binding sites of the transcription factor Crp in E. coli.
Collapse
|
45
|
|
46
|
Bickel PJ, Levina E. Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations. BERNOULLI 2004. [DOI: 10.3150/bj/1106314847] [Citation(s) in RCA: 332] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
47
|
Ge Z, J. Bickel P, A. Rice J. An approximate likelihood approach to nonlinear mixed effects models via spline approximation. Comput Stat Data Anal 2004. [DOI: 10.1016/j.csda.2003.10.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
48
|
Kechris KJ, van Zwet E, Bickel PJ, Eisen MB. Detecting DNA regulatory motifs by incorporating positional trends in information content. Genome Biol 2004; 5:R50. [PMID: 15239835 PMCID: PMC463320 DOI: 10.1186/gb-2004-5-7-r50] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2004] [Revised: 05/04/2004] [Accepted: 05/04/2004] [Indexed: 11/10/2022] Open
Abstract
On the basis of the observation that conserved positions in transcription factor binding sites are often clustered together, we propose a simple extension to the model-based motif discovery methods. We assign position-specific prior distributions to the frequency parameters of the model, penalizing deviations from a specified conservation profile. Examples with both simulated and real data show that this extension helps discover motifs as the data become noisier or when there is a competing false motif.
Collapse
Affiliation(s)
- Katherina J Kechris
- Department of Statistics, University of California, Berkeley, CA 94720, USA
- Current address: Department of Biochemistry and Biophysics, 600 16th Street 2240, University of California, San Francisco, CA 94143, USA
| | - Erik van Zwet
- Department of Statistics, University of California, Berkeley, CA 94720, USA
- Current address: Mathematical Institute, University Leiden, 2300 RA Leiden, The Netherlands
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | - Michael B Eisen
- Department of Genome Sciences, Life Sciences Division, Ernest Orlando Lawrence Berkeley National Lab, Cyclotron Road, Berkeley, CA 94720, USA
- Center for Integrative Genomics, Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
49
|
Bartlett PL, Bickel PJ, Bühlmann P, Freund Y, Friedman J, Hastie T, Jiang W, Jordan MJ, Koltchinskii V, Lugosi G, McAuliffe JD, Ritov Y, Rosset S, Schapire RE, Tibshirani R, Vayatis N, Yu B, Zhang T, Zhu J. Discussions of boosting papers, and rejoinders. Ann Stat 2004. [DOI: 10.1214/aos/1105988581] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
50
|
|