51
|
Zacher B, Torkler P, Tresch A. Analysis of Affymetrix ChIP-chip data using starr and R/Bioconductor. Cold Spring Harb Protoc 2011; 2011:pdb.top110. [PMID: 21536772 DOI: 10.1101/pdb.top110] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
INTRODUCTION This article provides a flexible workflow for the analysis of chromatin immunoprecipitation data (ChIP-chip) that covers issues from quality control, probe sequence remapping, data preprocessing/normalization, visualization, and high-level analysis like peak finding. It emphasizes the peculiarities of single-color Affymetrix arrays, but it is flexible enough to be also amenable to other array platforms. The article is accompanied by extensive code implementing each of the analysis steps.
Collapse
Affiliation(s)
- Benedikt Zacher
- Department of Biochemistry, Center for Integrated Protein Sciences and Munich Center for Advanced Photonics at the Gene Center, Ludwig-Maximilians-University Munich, Munich, Germany
| | | | | |
Collapse
|
52
|
Shepard PJ, Choi EA, Lu J, Flanagan LA, Hertel KJ, Shi Y. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA (NEW YORK, N.Y.) 2011; 17:761-72. [PMID: 21343387 PMCID: PMC3062186 DOI: 10.1261/rna.2581711] [Citation(s) in RCA: 326] [Impact Index Per Article: 25.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2010] [Accepted: 01/11/2011] [Indexed: 05/20/2023]
Abstract
Alternative polyadenylation (APA) of mRNAs has emerged as an important mechanism for post-transcriptional gene regulation in higher eukaryotes. Although microarrays have recently been used to characterize APA globally, they have a number of serious limitations that prevents comprehensive and highly quantitative analysis. To better characterize APA and its regulation, we have developed a deep sequencing-based method called Poly(A) Site Sequencing (PAS-Seq) for quantitatively profiling RNA polyadenylation at the transcriptome level. PAS-Seq not only accurately and comprehensively identifies poly(A) junctions in mRNAs and noncoding RNAs, but also provides quantitative information on the relative abundance of polyadenylated RNAs. PAS-Seq analyses of human and mouse transcriptomes showed that 40%-50% of all expressed genes produce alternatively polyadenylated mRNAs. Furthermore, our study detected evolutionarily conserved polyadenylation of histone mRNAs and revealed novel features of mitochondrial RNA polyadenylation. Finally, PAS-Seq analyses of mouse embryonic stem (ES) cells, neural stem/progenitor (NSP) cells, and neurons not only identified more poly(A) sites than what was found in the entire mouse EST database, but also detected significant changes in the global APA profile that lead to lengthening of 3' untranslated regions (UTR) in many mRNAs during stem cell differentiation. Together, our PAS-Seq analyses revealed a complex landscape of RNA polyadenylation in mammalian cells and the dynamic regulation of APA during stem cell differentiation.
Collapse
Affiliation(s)
- Peter J Shepard
- Department of Microbiology and Molecular Genetics, University of California at Irvine, Irvine, California 92697, USA
| | | | | | | | | | | |
Collapse
|
53
|
Spudich GM, Fernández-Suárez XM. Disease and phenotype data at Ensembl. CURRENT PROTOCOLS IN HUMAN GENETICS 2011; Chapter 6:Unit 6.11. [PMID: 21400687 PMCID: PMC3099348 DOI: 10.1002/0471142905.hg0611s69] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Biological databases are an important resource for the life sciences community. Accessing the hundreds of databases supporting molecular biology and related fields is a daunting and time-consuming task. Integrating this information into one access point is a necessity for the life sciences community, which includes researchers focusing on human disease. Here we discuss the Ensembl genome browser, which acts as a single entry point with Graphical User Interface to data from multiple projects, including OMIM, dbSNP, and the NHGRI GWAS catalog. Ensembl provides a comprehensive source of annotation for the human genome, along with other species of biomedical interest. In this unit, we explore how to use the Ensembl genome browser in example queries related to human genetic diseases. Support protocols demonstrate quick sequence export using the BioMart tool.
Collapse
|
54
|
Otto C, Hoffmann S, Gorodkin J, Stadler PF. Fast local fragment chaining using sum-of-pair gap costs. Algorithms Mol Biol 2011; 6:4. [PMID: 21418573 PMCID: PMC3072320 DOI: 10.1186/1748-7188-6-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Accepted: 03/18/2011] [Indexed: 01/11/2023] Open
Abstract
Background Fast seed-based alignment heuristics such as BLAST and BLAT have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity. Results Here we present a fast and flexible fragment chainer that for the first time also supports a sum-of-pair gap cost model. This model has proven to achieve a higher accuracy and sensitivity in its own field of application. Due to a highly time-efficient index structure our method outperforms the only existing tool for fragment chaining under the linear gap cost model. It can easily be applied to the output generated by alignment tools such as segemehl or BLAST. As an example we consider homology-based searches for human and mouse snoRNAs demonstrating that a highly sensitive BLAST search with subsequent chaining is an attractive option. The sum-of-pair gap costs provide a substantial advantage is this context. Conclusions Chaining of short match fragments helps to quickly and accurately identify regions of homology that may not be found using local alignment heuristics alone. By providing both the linear and the sum-of-pair gap cost model, a wider range of application can be covered. The software clasp is available at http://www.bioinf.uni-leipzig.de/Software/clasp/.
Collapse
|
55
|
Rödelsperger C, Krawitz P, Bauer S, Hecht J, Bigham AW, Bamshad M, de Condor BJ, Schweiger MR, Robinson PN. Identity-by-descent filtering of exome sequence data for disease-gene identification in autosomal recessive disorders. Bioinformatics 2011; 27:829-36. [PMID: 21278187 PMCID: PMC3051326 DOI: 10.1093/bioinformatics/btr022] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2010] [Revised: 12/13/2010] [Accepted: 01/11/2011] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Next-generation sequencing and exome-capture technologies are currently revolutionizing the way geneticists screen for disease-causing mutations in rare Mendelian disorders. However, the identification of causal mutations is challenging due to the sheer number of variants that are identified in individual exomes. Although databases such as dbSNP or HapMap can be used to reduce the plethora of candidate genes by filtering out common variants, the remaining set of genes still remains on the order of dozens. RESULTS Our algorithm uses a non-homogeneous hidden Markov model that employs local recombination rates to identify chromosomal regions that are identical by descent (IBD = 2) in children of consanguineous or non-consanguineous parents solely based on genotype data of siblings derived from high-throughput sequencing platforms. Using simulated and real exome sequence data, we show that our algorithm is able to reduce the search space for the causative disease gene to a fifth or a tenth of the entire exome. AVAILABILITY An R script and an accompanying tutorial are available at http://compbio.charite.de/index.php/ibd2.html.
Collapse
|
56
|
Pandey RV, Kofler R, Orozco-terWengel P, Nolte V, Schlötterer C. PoPoolation DB: a user-friendly web-based database for the retrieval of natural polymorphisms in Drosophila. BMC Genet 2011; 12:27. [PMID: 21366916 PMCID: PMC3060855 DOI: 10.1186/1471-2156-12-27] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 03/02/2011] [Indexed: 11/18/2022] Open
Abstract
Background The enormous potential of natural variation for the functional characterization of genes has been neglected for a long time. Only since recently, functional geneticists are starting to account for natural variation in their analyses. With the new sequencing technologies it has become feasible to collect sequence information for multiple individuals on a genomic scale. In particular sequencing pooled DNA samples has been shown to provide a cost-effective approach for characterizing variation in natural populations. While a range of software tools have been developed for mapping these reads onto a reference genome and extracting SNPs, linking this information to population genetic estimators and functional information still poses a major challenge to many researchers. Results We developed PoPoolation DB a user-friendly integrated database. Popoolation DB links variation in natural populations with functional information, allowing a wide range of researchers to take advantage of population genetic data. PoPoolation DB provides the user with population genetic parameters (Watterson's θ or Tajima's π), Tajima's D, SNPs, allele frequencies and indels in regions of interest. The database can be queried by gene name, chromosomal position, or a user-provided query sequence or GTF file. We anticipate that PoPoolation DB will be a highly versatile tool for functional geneticists as well as evolutionary biologists. Conclusions PoPoolation DB, available at http://www.popoolation.at/pgt, provides an integrated platform for researchers to investigate natural polymorphism and associated functional annotations from UCSC and Flybase genome browsers, population genetic estimators and RNA-seq information.
Collapse
Affiliation(s)
- Ram Vinay Pandey
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, Vienna, Austria
| | | | | | | | | |
Collapse
|
57
|
PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One 2011; 6:e15925. [PMID: 21253599 PMCID: PMC3017084 DOI: 10.1371/journal.pone.0015925] [Citation(s) in RCA: 395] [Impact Index Per Article: 30.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Accepted: 11/30/2010] [Indexed: 11/19/2022] Open
Abstract
Recent statistical analyses suggest that sequencing of pooled samples provides a cost effective approach to determine genome-wide population genetic parameters. Here we introduce PoPoolation, a toolbox specifically designed for the population genetic analysis of sequence data from pooled individuals. PoPoolation calculates estimates of θWatterson, θπ, and Tajima's D that account for the bias introduced by pooling and sequencing errors, as well as divergence between species. Results of genome-wide analyses can be graphically displayed in a sliding window plot. PoPoolation is written in Perl and R and it builds on commonly used data formats. Its source code can be downloaded from http://code.google.com/p/popoolation/. Furthermore, we evaluate the influence of mapping algorithms, sequencing errors, and read coverage on the accuracy of population genetic parameter estimates from pooled data.
Collapse
|
58
|
Oshchepkov DY, Levitsky VG. In silico prediction of transcriptional factor-binding sites. Methods Mol Biol 2011; 760:251-67. [PMID: 21780002 DOI: 10.1007/978-1-61779-176-5_16] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The recognition of transcription factor binding sites (TFBSs) is the first step on the way to deciphering the DNA regulatory code. A large variety of computational approaches and corresponding in silico tools for TFBS recognition are available, each having their own advantages and shortcomings. This chapter provides a brief tutorial to assist end users in the application of these tools for functional characterization of genes.
Collapse
Affiliation(s)
- Dmitry Y Oshchepkov
- Laboratory of Theoretical Genetics, Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia.
| | | |
Collapse
|
59
|
Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, Burtt NP, Rivas M, Guiducci C, Bruno DL, Goldberger OA, Redman MC, Wiltshire E, Wilson CJ, Altshuler D, Gabriel SB, Daly MJ, Thorburn DR, Mootha VK. High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat Genet 2010; 42:851-8. [PMID: 20818383 PMCID: PMC2977978 DOI: 10.1038/ng.659] [Citation(s) in RCA: 284] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2010] [Accepted: 08/11/2010] [Indexed: 12/15/2022]
Abstract
Discovering the molecular basis of mitochondrial respiratory chain disease is challenging given the large number of both mitochondrial and nuclear genes that are involved. We report a strategy of focused candidate gene prediction, high-throughput sequencing and experimental validation to uncover the molecular basis of mitochondrial complex I disorders. We created seven pools of DNA from a cohort of 103 cases and 42 healthy controls and then performed deep sequencing of 103 candidate genes to identify 151 rare variants that were predicted to affect protein function. We established genetic diagnoses in 13 of 60 previously unsolved cases using confirmatory experiments, including cDNA complementation to show that mutations in NUBPL and FOXRED1 can cause complex I deficiency. Our study illustrates how large-scale sequencing, coupled with functional prediction and experimental validation, can be used to identify causal mutations in individual cases.
Collapse
Affiliation(s)
- Sarah E Calvo
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
60
|
Wittig M, Helbig I, Schreiber S, Franke A. CNVineta: a data mining tool for large case-control copy number variation datasets. Bioinformatics 2010; 26:2208-9. [PMID: 20605930 PMCID: PMC2922892 DOI: 10.1093/bioinformatics/btq356] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2010] [Revised: 06/05/2010] [Accepted: 06/29/2010] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Copy number variation (CNV), a major contributor to human genetic variation, comprises >/= 1 kb genomic deletions and insertions. Yet, the identification of CNVs from microarray data is still hampered by high false negative and positive prediction rates due to the noisy nature of the raw data. Here, we present CNVineta, an R package for rapid data mining and visualization of CNVs in large case-control datasets genotyped with single nucleotide polymorphism oligonucleotide arrays. CNVineta is compatible with various established CNV prediction algorithms, can be used for genome-wide association analysis of rare and common CNVs and enables rapid and serial display of log(2) of raw data ratios as well as B-allele frequencies for visual quality inspection. In summary, CNVineta aides in the interpretation of large-scale CNV datasets and prioritization of target regions for follow-up experiments. AVAILABILITY AND IMPLEMENTATION CNVineta is available as an R package and can be downloaded from http://www.ikmb.uni-kiel.de/CNVineta/; the package contains a tutorial outlining a typical workflow. The CNVineta compatible HapMap dataset can also be downloaded from the link above.
Collapse
Affiliation(s)
- Michael Wittig
- Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Schittenhelmstrasse 12, 24105 Kiel, Germany.
| | | | | | | |
Collapse
|
61
|
Coulombe Y, Lemieux M, Moreau J, Aubin J, Joksimovic M, Bérubé-Simard FA, Tabariès S, Boucherat O, Guillou F, Larochelle C, Tuggle CK, Jeannotte L. Multiple promoters and alternative splicing: Hoxa5 transcriptional complexity in the mouse embryo. PLoS One 2010; 5:e10600. [PMID: 20485555 PMCID: PMC2868907 DOI: 10.1371/journal.pone.0010600] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2010] [Accepted: 04/13/2010] [Indexed: 12/28/2022] Open
Abstract
Background The genomic organization of Hox clusters is fundamental for the precise spatio-temporal regulation and the function of each Hox gene, and hence for correct embryo patterning. Multiple overlapping transcriptional units exist at the Hoxa5 locus reflecting the complexity of Hox clustering: a major form of 1.8 kb corresponding to the two characterized exons of the gene and polyadenylated RNA species of 5.0, 9.5 and 11.0 kb. This transcriptional intricacy raises the question of the involvement of the larger transcripts in Hox function and regulation. Methodology/Principal Findings We have undertaken the molecular characterization of the Hoxa5 larger transcripts. They initiate from two highly conserved distal promoters, one corresponding to the putative Hoxa6 promoter, and a second located nearby Hoxa7. Alternative splicing is also involved in the generation of the different transcripts. No functional polyadenylation sequence was found at the Hoxa6 locus and all larger transcripts use the polyadenylation site of the Hoxa5 gene. Some larger transcripts are potential Hoxa6/Hoxa5 bicistronic units. However, even though all transcripts could produce the genuine 270 a.a. HOXA5 protein, only the 1.8 kb form is translated into the protein, indicative of its essential role in Hoxa5 gene function. The Hoxa6 mutation disrupts the larger transcripts without major phenotypic impact on axial specification in their expression domain. However, Hoxa5-like skeletal anomalies are observed in Hoxa6 mutants and these defects can be explained by the loss of expression of the 1.8 kb transcript. Our data raise the possibility that the larger transcripts may be involved in Hoxa5 gene regulation. Significance Our observation that the Hoxa5 larger transcripts possess a developmentally-regulated expression combined to the increasing sum of data on the role of long noncoding RNAs in transcriptional regulation suggest that the Hoxa5 larger transcripts may participate in the control of Hox gene expression.
Collapse
Affiliation(s)
- Yan Coulombe
- Centre de recherche en cancérologie de l'Université Laval, Centre Hospitalier Universitaire de Québec, L'Hôtel-Dieu de Québec, Québec, Québec, Canada
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
62
|
Abstract
Many people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
Collapse
Affiliation(s)
- Mihaela Pertea
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Steven L Salzberg
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|