301
|
Bianco L, Cestaro A, Sargent DJ, Banchi E, Derdak S, Di Guardo M, Salvi S, Jansen J, Viola R, Gut I, Laurens F, Chagné D, Velasco R, van de Weg E, Troggio M. Development and validation of a 20K single nucleotide polymorphism (SNP) whole genome genotyping array for apple (Malus × domestica Borkh). PLoS One 2014; 9:e110377. [PMID: 25303088 PMCID: PMC4193858 DOI: 10.1371/journal.pone.0110377] [Citation(s) in RCA: 108] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2014] [Accepted: 09/12/2014] [Indexed: 01/08/2023] Open
Abstract
High-density SNP arrays for genome-wide assessment of allelic variation have made high resolution genetic characterization of crop germplasm feasible. A medium density array for apple, the IRSC 8K SNP array, has been successfully developed and used for screens of bi-parental populations. However, the number of robust and well-distributed markers contained on this array was not sufficient to perform genome-wide association analyses in wider germplasm sets, or Pedigree-Based Analysis at high precision, because of rapid decay of linkage disequilibrium. We describe the development of an Illumina Infinium array targeting 20K SNPs. The SNPs were predicted from re-sequencing data derived from the genomes of 13 Malus × domestica apple cultivars and one accession belonging to a crab apple species (M. micromalus). A pipeline for SNP selection was devised that avoided the pitfalls associated with the inclusion of paralogous sequence variants, supported the construction of robust multi-allelic SNP haploblocks and selected up to 11 entries within narrow genomic regions of ±5 kb, termed focal points (FPs). Broad genome coverage was attained by placing FPs at 1 cM intervals on a consensus genetic map, complementing them with FPs to enrich the ends of each of the chromosomes, and by bridging physical intervals greater than 400 Kbps. The selection also included ∼3.7K validated SNPs from the IRSC 8K array. The array has already been used in other studies where ∼15.8K SNP markers were mapped with an average of ∼6.8K SNPs per full-sib family. The newly developed array with its high density of polymorphic validated SNPs is expected to be of great utility for Pedigree-Based Analysis and Genomic Selection. It will also be a valuable tool to help dissect the genetic mechanisms controlling important fruit quality traits, and to aid the identification of marker-trait associations suitable for the application of Marker Assisted Selection in apple breeding programs.
Collapse
Affiliation(s)
- Luca Bianco
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Alessandro Cestaro
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Daniel James Sargent
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Elisa Banchi
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Sophia Derdak
- CNAG – Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, Spain
| | - Mario Di Guardo
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- Wageningen UR Plant Breeding, Wageningen University and Research Centre, Wageningen, The Netherlands
| | | | - Johannes Jansen
- Biometris, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Roberto Viola
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Ivo Gut
- CNAG – Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, Spain
| | - Francois Laurens
- INRA, UMR1345 Institut de Recherche en Horticulture and Semences, Beaucouzé, France
| | - David Chagné
- Plant & Food Research, Palmerston North Research Centre, Palmerston North, New Zealand
| | - Riccardo Velasco
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Eric van de Weg
- Wageningen UR Plant Breeding, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Michela Troggio
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- * E-mail:
| |
Collapse
|
302
|
DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Res 2014; 24:2022-32. [PMID: 25236618 PMCID: PMC4248318 DOI: 10.1101/gr.175141.114] [Citation(s) in RCA: 305] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1× genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.
Collapse
|
303
|
Meyer CA, Liu XS. Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat Rev Genet 2014; 15:709-21. [PMID: 25223782 DOI: 10.1038/nrg3788] [Citation(s) in RCA: 205] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Next-generation sequencing (NGS) technologies have been used in diverse ways to investigate various aspects of chromatin biology by identifying genomic loci that are bound by transcription factors, occupied by nucleosomes or accessible to nuclease cleavage, or loci that physically interact with remote genomic loci. However, reaching sound biological conclusions from such NGS enrichment profiles requires many potential biases to be taken into account. In this Review, we discuss common ways in which biases may be introduced into NGS chromatin profiling data, approaches to diagnose these biases and analytical techniques to mitigate their effect.
Collapse
Affiliation(s)
- Clifford A Meyer
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, Massachusetts 02115, USA; and Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - X Shirley Liu
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, Massachusetts 02115, USA; and Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| |
Collapse
|
304
|
Supek F, Lehner B, Hajkova P, Warnecke T. Hydroxymethylated cytosines are associated with elevated C to G transversion rates. PLoS Genet 2014; 10:e1004585. [PMID: 25211471 PMCID: PMC4161303 DOI: 10.1371/journal.pgen.1004585] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Accepted: 07/07/2014] [Indexed: 11/23/2022] Open
Abstract
It has long been known that methylated cytosines deaminate at higher rates than unmodified cytosines and constitute mutational hotspots in mammalian genomes. The repertoire of naturally occurring cytosine modifications, however, extends beyond 5-methylcytosine to include its oxidation derivatives, notably 5-hydroxymethylcytosine. The effects of these modifications on sequence evolution are unknown. Here, we combine base-resolution maps of methyl- and hydroxymethylcytosine in human and mouse with population genomic, divergence and somatic mutation data to show that hydroxymethylated and methylated cytosines show distinct patterns of variation and evolution. Surprisingly, hydroxymethylated sites are consistently associated with elevated C to G transversion rates at the level of segregating polymorphisms, fixed substitutions, and somatic mutations in tumors. Controlling for multiple potential confounders, we find derived C to G SNPs to be 1.43-fold (1.22-fold) more common at hydroxymethylated sites compared to methylated sites in human (mouse). Increased C to G rates are evident across diverse functional and sequence contexts and, in cancer genomes, correlate with the expression of Tet enzymes and specific components of the mismatch repair pathway (MSH2, MSH6, and MBD4). Based on these and other observations we suggest that hydroxymethylation is associated with a distinct mutational burden and that the mismatch repair pathway is implicated in causing elevated transversion rates at hydroxymethylated cytosines. Most cytosines that occur in a CpG context in mammalian genomes are methylated. Methylation has important functional consequences in the cell but also affects genome evolution. Notably, methylated cytosines are prone to deaminate and constitute mutational hotspots in mammalian genomes. Recently, a series of other modifications, derived from the oxidation of methylated cytosines, was shown to exist in various mammalian cell types including embryonic stem cells. The most abundant of these modifications is 5-hydroxymethylcytosine. In this work, we ask whether methylated and hydroxymethylated cytosines are subject to the same mutational biases or lead to distinct patterns of genome evolution. To do so, we examine differences between individuals, between species, and between normal and cancer tissues alongside high-resolution maps of DNA methylation and hydroxymethylation in the human and mouse genomes. Unexpectedly, we find that hydroxymethylated cytosines are associated with more cytosine to guanine changes in both human and mouse populations, in closely related species, and in the context of somatic evolution in tumors. Based on multiple lines of evidence, we suggest that the different patterns of sequence evolution at methylated and hydroxymethylated sites are owing to differences in how these sites are handled by the DNA repair machinery.
Collapse
Affiliation(s)
- Fran Supek
- EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (CRG), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
| | - Ben Lehner
- EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (CRG), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats, Centre for Genomic Regulation (CRG) and UPF, Barcelona, Spain
| | - Petra Hajkova
- Reprogramming and Chromatin Group, MRC Clinical Sciences Centre, Imperial College, Hammersmith Campus, London, United Kingdom
| | - Tobias Warnecke
- Molecular Systems Group, MRC Clinical Sciences Centre, Imperial College, Hammersmith Campus, London, United Kingdom
- * E-mail:
| |
Collapse
|
305
|
Liu B, Morrison CD, Johnson CS, Trump DL, Qin M, Conroy JC, Wang J, Liu S. Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges. Oncotarget 2014; 4:1868-81. [PMID: 24240121 PMCID: PMC3875755 DOI: 10.18632/oncotarget.1537] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Accurate detection of somatic copy number variations (CNVs) is an essential part of cancer genome analysis, and plays an important role in oncotarget identifications. Next generation sequencing (NGS) holds the promise to revolutionize somatic CNV detection. In this review, we provide an overview of current analytic tools used for CNV detection in NGS-based cancer studies. We summarize the NGS data types used for CNV detection, decipher the principles for data preprocessing, segmentation, and interpretation, and discuss the challenges in somatic CNV detection. This review aims to provide a guide to the analytic tools used in NGS-based cancer CNV studies, and to discuss the important factors that researchers need to consider when analyzing NGS data for somatic CNV detections.
Collapse
Affiliation(s)
- Biao Liu
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY
| | | | | | | | | | | | | | | |
Collapse
|
306
|
A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun 2014; 5:4812. [PMID: 25176035 PMCID: PMC4166679 DOI: 10.1038/ncomms5812] [Citation(s) in RCA: 437] [Impact Index Per Article: 43.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2014] [Accepted: 07/25/2014] [Indexed: 12/31/2022] Open
Abstract
Strain-specific genomic diversity in the Mycobacterium tuberculosis complex (MTBC) is an important factor in pathogenesis that may affect virulence, transmissibility, host response and emergence of drug resistance. Several systems have been proposed to classify MTBC strains into distinct lineages and families. Here, we investigate single-nucleotide polymorphisms (SNPs) as robust (stable) markers of genetic variation for phylogenetic analysis. We identify ~92k SNP across a global collection of 1,601 genomes. The SNP-based phylogeny is consistent with the gold-standard regions of difference (RD) classification system. Of the ~7k strain-specific SNPs identified, 62 markers are proposed to discriminate known circulating strains. This SNP-based barcode is the first to cover all main lineages, and classifies a greater number of sublineages than current alternatives. It may be used to classify clinical isolates to evaluate tools to control the disease, including therapeutics and vaccines whose effectiveness may vary by strain type. Genetic variation in Mycobacterium tuberculosis complex (MTBC) bacteria is responsible for differences in factors such as virulence and transmissibility. Here, the authors analyse the genomes of 1,601 MTBC isolates from diverse geographic locations and identify 62 SNPs that may be used to resolve lineages and sublineages of these strains.
Collapse
|
307
|
Lee H, McManus CJ, Cho DY, Eaton M, Renda F, Somma MP, Cherbas L, May G, Powell S, Zhang D, Zhan L, Resch A, Andrews J, Celniker SE, Cherbas P, Przytycka TM, Gatti M, Oliver B, Graveley B, MacAlpine D. DNA copy number evolution in Drosophila cell lines. Genome Biol 2014; 15:R70. [PMID: 25262759 PMCID: PMC4289277 DOI: 10.1186/gb-2014-15-8-r70] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Accepted: 07/01/2014] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Structural rearrangements of the genome resulting in genic imbalance due to copy number change are often deleterious at the organismal level, but are common in immortalized cell lines and tumors, where they may be an advantage to cells. In order to explore the biological consequences of copy number changes in the Drosophila genome, we resequenced the genomes of 19 tissue-culture cell lines and generated RNA-Seq profiles. RESULTS Our work revealed dramatic duplications and deletions in all cell lines. We found three lines of evidence indicating that copy number changes were due to selection during tissue culture. First, we found that copy numbers correlated to maintain stoichiometric balance in protein complexes and biochemical pathways, consistent with the gene balance hypothesis. Second, while most copy number changes were cell line-specific, we identified some copy number changes shared by many of the independent cell lines. These included dramatic recurrence of increased copy number of the PDGF/VEGF receptor, which is also over-expressed in many cancer cells, and of bantam, an anti-apoptosis miRNA. Third, even when copy number changes seemed distinct between lines, there was strong evidence that they supported a common phenotypic outcome. For example, we found that proto-oncogenes were over-represented in one cell line (S2-DRSC), whereas tumor suppressor genes were under-represented in another (Kc167). CONCLUSION Our study illustrates how genome structure changes may contribute to selection of cell lines in vitro. This has implications for other cell-level natural selection progressions, including tumorigenesis.
Collapse
Affiliation(s)
- Hangnoh Lee
- />National Institute of Diabetes, Digestive, and Kidney Diseases, National Institutes of Health, 50 South Drive, Bethesda, MD 20892 USA
| | - C Joel McManus
- />Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, CT 06030 USA
- />Department of Biological Sciences, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA 15213 USA
| | - Dong-Yeon Cho
- />Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20892 USA
| | - Matthew Eaton
- />Department of Pharmacology and Cancer Biology, Duke University Medical Center, Levine Science Research Center, 308 Research Drive, Durham, NC 27708 USA
| | - Fioranna Renda
- />Istituto di Biologia e Patologia Molecolari (IBPM) del CNR and Dipartimento di Biologia e Biotecnologie, Sapienza, Università di Roma, 5 Aldo Moro Piazzale, Rome, 00185 Italy
| | - Maria Patrizia Somma
- />Istituto di Biologia e Patologia Molecolari (IBPM) del CNR and Dipartimento di Biologia e Biotecnologie, Sapienza, Università di Roma, 5 Aldo Moro Piazzale, Rome, 00185 Italy
| | - Lucy Cherbas
- />Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, IN 47405 USA
| | - Gemma May
- />Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, CT 06030 USA
- />Department of Biological Sciences, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA 15213 USA
| | - Sara Powell
- />Department of Pharmacology and Cancer Biology, Duke University Medical Center, Levine Science Research Center, 308 Research Drive, Durham, NC 27708 USA
| | - Dayu Zhang
- />Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, IN 47405 USA
- />School of Agricultural and Food Science, Zhejiang A&F University, 88 Huan Cheng Bei Road, Lin’an, Zhejiang 311300 China
| | - Lijun Zhan
- />Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, CT 06030 USA
| | - Alissa Resch
- />Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, CT 06030 USA
| | - Justen Andrews
- />Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, IN 47405 USA
| | - Susan E Celniker
- />Department of Genome Dynamics, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720 USA
| | - Peter Cherbas
- />Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, IN 47405 USA
| | - Teresa M Przytycka
- />Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20892 USA
| | - Maurizio Gatti
- />Istituto di Biologia e Patologia Molecolari (IBPM) del CNR and Dipartimento di Biologia e Biotecnologie, Sapienza, Università di Roma, 5 Aldo Moro Piazzale, Rome, 00185 Italy
| | - Brian Oliver
- />National Institute of Diabetes, Digestive, and Kidney Diseases, National Institutes of Health, 50 South Drive, Bethesda, MD 20892 USA
| | - Brenton Graveley
- />Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, CT 06030 USA
| | - David MacAlpine
- />Department of Pharmacology and Cancer Biology, Duke University Medical Center, Levine Science Research Center, 308 Research Drive, Durham, NC 27708 USA
| |
Collapse
|
308
|
Li W, Freudenberg J. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Comput Biol Chem 2014; 53 Pt A:108-17. [PMID: 25241312 DOI: 10.1016/j.compbiolchem.2014.08.015] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 12/31/2022]
Abstract
Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1 kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, NY 11030, USA.
| | - Jan Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, NY 11030, USA
| |
Collapse
|
309
|
Abstract
MOTIVATION Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material. RESULTS We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries. AVAILABILITY AND IMPLEMENTATION The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq. CONTACT andrewds@usc.edu SUPPLEMENTARY INFORMATION Supplementary material is available at Bioinformatics online.
Collapse
Affiliation(s)
- Timothy Daley
- Department of Mathematics and Department of Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Andrew D Smith
- Department of Mathematics and Department of Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
310
|
Meynert AM, Ansari M, FitzPatrick DR, Taylor MS. Variant detection sensitivity and biases in whole genome and exome sequencing. BMC Bioinformatics 2014; 15:247. [PMID: 25038816 PMCID: PMC4122774 DOI: 10.1186/1471-2105-15-247] [Citation(s) in RCA: 147] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2014] [Accepted: 07/07/2014] [Indexed: 12/30/2022] Open
Abstract
Background Less than two percent of the human genome is protein coding, yet that small fraction harbours the majority of known disease causing mutations. Despite rapidly falling whole genome sequencing (WGS) costs, much research and increasingly the clinical use of sequence data is likely to remain focused on the protein coding exome. We set out to quantify and understand how WGS compares with the targeted capture and sequencing of the exome (exome-seq), for the specific purpose of identifying single nucleotide polymorphisms (SNPs) in exome targeted regions. Results We have compared polymorphism detection sensitivity and systematic biases using a set of tissue samples that have been subject to both deep exome and whole genome sequencing. The scoring of detection sensitivity was based on sequence down sampling and reference to a set of gold-standard SNP calls for each sample. Despite evidence of incremental improvements in exome capture technology over time, whole genome sequencing has greater uniformity of sequence read coverage and reduced biases in the detection of non-reference alleles than exome-seq. Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. Known disease causing mutations are not biased towards easy or hard to sequence areas of the genome for either exome-seq or WGS. Conclusions From an economic perspective, WGS is at parity with exome-seq for variant detection in the targeted coding regions. WGS offers benefits in uniformity of read coverage and more balanced allele ratio calls, both of which can in most cases be offset by deeper exome-seq, with the caveat that some exome-seq targets will never achieve sufficient mapped read depth for variant detection due to technical difficulties or probe failures. As WGS is intrinsically richer data that can provide insight into polymorphisms outside coding regions and reveal genomic rearrangements, it is likely to progressively replace exome-seq for many applications. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-247) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alison M Meynert
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Crewe Road, EH4 2XU Edinburgh, UK.
| | | | | | | |
Collapse
|
311
|
Han L, Yuan Y, Zheng S, Yang Y, Li J, Edgerton ME, Diao L, Xu Y, Verhaak RGW, Liang H. The Pan-Cancer analysis of pseudogene expression reveals biologically and clinically relevant tumour subtypes. Nat Commun 2014; 5:3963. [PMID: 24999802 PMCID: PMC4339277 DOI: 10.1038/ncomms4963] [Citation(s) in RCA: 109] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2014] [Accepted: 06/13/2014] [Indexed: 12/30/2022] Open
Abstract
Although individual pseudogenes have been implicated in tumor biology, the biomedical significance and clinical relevance of pseudogene expression have not been assessed in a systematic way. Here we generate pseudogene expression profiles in 2,808 patient samples of seven cancer types from The Cancer Genome Atlas RNA-seq data using a newly developed computational pipeline. Supervised analysis reveals a significant number of pseudogenes differentially expressed among established tumor subtypes; and pseudogene expression alone can accurately classify the major histological subtypes of endometrial cancer. Across cancer types, the tumor subtypes revealed by pseudogene expression show extensive and strong concordance with the subtypes defined by other molecular data. Strikingly, in kidney cancer, the pseudogene-expression subtypes not only significantly correlate with patient survival, but also help stratify patients in combination with clinical variables. Our study highlights the potential of pseudogene expression analysis as a new paradigm for investigating cancer mechanisms and discovering prognostic biomarkers.
Collapse
Affiliation(s)
- Leng Han
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA
| | - Yuan Yuan
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA.,Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Siyuan Zheng
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA
| | - Yang Yang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA.,Division of Biostatistics, The University of Texas Health Science Center at Houston, School of Public Health, Houston, TX 77030, USA
| | - Jun Li
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA
| | - Mary E Edgerton
- Department of Pathology, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Houston, TX 77030, USA
| | - Lixia Diao
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA
| | - Yanxun Xu
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA
| | - Roeland G W Verhaak
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA
| | - Han Liang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, USA.,Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
312
|
Nevado B, Perez-Enciso M. Pipeliner: software to evaluate the performance of bioinformatics pipelines for next-generation resequencing. Mol Ecol Resour 2014; 15:99-106. [PMID: 24890372 DOI: 10.1111/1755-0998.12286] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 05/19/2014] [Accepted: 05/23/2014] [Indexed: 12/30/2022]
Abstract
The choice of technology and bioinformatics approach is critical in obtaining accurate and reliable information from next-generation sequencing (NGS) experiments. An increasing number of software and methodological guidelines are being published, but deciding upon which approach and experimental design to use can depend on the particularities of the species and on the aims of the study. This leaves researchers unable to produce informed decisions on these central questions. To address these issues, we developed pipeliner - a tool to evaluate, by simulation, the performance of NGS pipelines in resequencing studies. Pipeliner provides a graphical interface allowing the users to write and test their own bioinformatics pipelines with publicly available or custom software. It computes a number of statistics summarizing the performance in SNP calling, including the recovery, sensitivity and false discovery rate for heterozygous and homozygous SNP genotypes. Pipeliner can be used to answer many practical questions, for example, for a limited amount of NGS effort, how many more reliable SNPs can be detected by doubling coverage and halving sample size or what is the false discovery rate provided by different SNP calling algorithms and options. Pipeliner thus allows researchers to carefully plan their study's sampling design and compare the suitability of alternative bioinformatics approaches for their specific study systems. Pipeliner is written in C++ and is freely available from http://github.com/brunonevado/Pipeliner.
Collapse
Affiliation(s)
- B Nevado
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, 08193, Bellaterra, Spain; Universitat Autònoma de Barcelona, 08193, Bellaterra, Spain
| | | |
Collapse
|
313
|
Ilut DC, Nydam ML, Hare MP. Defining loci in restriction-based reduced representation genomic data from nonmodel species: sources of bias and diagnostics for optimal clustering. BIOMED RESEARCH INTERNATIONAL 2014; 2014:675158. [PMID: 25057498 PMCID: PMC4095725 DOI: 10.1155/2014/675158] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Revised: 06/02/2014] [Accepted: 06/04/2014] [Indexed: 01/15/2023]
Abstract
Next generation sequencing holds great promise for applications of phylogeography, landscape genetics, and population genomics in wild populations of nonmodel species, but the robustness of inferences hinges on careful experimental design and effective bioinformatic removal of predictable artifacts. Addressing this issue, we use published genomes from a tunicate, stickleback, and soybean to illustrate the potential for bioinformatic artifacts and introduce a protocol to minimize two sources of error expected from similarity-based de-novo clustering of stacked reads: the splitting of alleles into different clusters, which creates false homozygosity, and the grouping of paralogs into the same cluster, which creates false heterozygosity. We present an empirical application focused on Ciona savignyi, a tunicate with very high SNP heterozygosity (~0.05), because high diversity challenges the computational efficiency of most existing nonmodel pipelines while also potentially exacerbating paralog artifacts. The simulated and empirical data illustrate the advantages of using higher sequence difference clustering thresholds than is typical and demonstrate the utility of our protocol for efficiently identifying an optimum threshold from data without prior knowledge of heterozygosity. The empirical Ciona savignyi data also highlight null alleles as a potentially large source of false homozygosity in restriction-based reduced representation genomic data.
Collapse
Affiliation(s)
- Daniel C. Ilut
- Department of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14850, USA
| | - Marie L. Nydam
- Division of Science and Mathematics, Centre College, Danville, KY 40422, USA
| | - Matthew P. Hare
- Department of Natural Resources, Cornell University, Ithaca, NY 14850, USA
| |
Collapse
|
314
|
Halimaa P, Blande D, Aarts MGM, Tuomainen M, Tervahauta A, Kärenlampi S. Comparative transcriptome analysis of the metal hyperaccumulator Noccaea caerulescens. FRONTIERS IN PLANT SCIENCE 2014; 5:213. [PMID: 24904610 PMCID: PMC4033236 DOI: 10.3389/fpls.2014.00213] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2013] [Accepted: 04/30/2014] [Indexed: 05/20/2023]
Abstract
The metal hyperaccumulator Noccaea caerulescens is an established model to study the adaptation of plants to metalliferous soils. Various comparators have been used in these studies. The choice of suitable comparators is important and depends on the hypothesis to be tested and methods to be used. In high-throughput analyses such as microarray, N. caerulescens has been compared to non-tolerant, non-accumulator plants like Arabidopsis thaliana or Thlaspi arvense rather than to the related hypertolerant or hyperaccumulator plants. An underutilized source is N. caerulescens populations with considerable variation in their capacity to accumulate and tolerate metals. Whole transcriptome sequencing (RNA-Seq) is revealing interesting variation in their gene expression profiles. Combining physiological characteristics of N. caerulescens accessions with their RNA-Seq has a great potential to provide detailed insight into the underlying molecular mechanisms, including entirely new gene products. In this review we will critically consider comparative transcriptome analyses carried out to explore metal hyperaccumulation and hypertolerance of N. caerulescens, and demonstrate the potential of RNA-Seq analysis as a tool in evolutionary genomics.
Collapse
Affiliation(s)
- Pauliina Halimaa
- Department of Biology, University of Eastern FinlandKuopio, Finland
| | - Daniel Blande
- Department of Biology, University of Eastern FinlandKuopio, Finland
| | - Mark G. M. Aarts
- Laboratory of Genetics, Wageningen UniversityWageningen, Netherlands
| | - Marjo Tuomainen
- Department of Biology, University of Eastern FinlandKuopio, Finland
| | - Arja Tervahauta
- Department of Biology, University of Eastern FinlandKuopio, Finland
| | - Sirpa Kärenlampi
- Department of Biology, University of Eastern FinlandKuopio, Finland
| |
Collapse
|
315
|
Abel HJ, Al-Kateb H, Cottrell CE, Bredemeyer AJ, Pritchard CC, Grossmann AH, Wallander ML, Pfeifer JD, Lockwood CM, Duncavage EJ. Detection of gene rearrangements in targeted clinical next-generation sequencing. J Mol Diagn 2014; 16:405-17. [PMID: 24813172 DOI: 10.1016/j.jmoldx.2014.03.006] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2013] [Revised: 02/24/2014] [Accepted: 03/06/2014] [Indexed: 12/30/2022] Open
Abstract
The identification of recurrent gene rearrangements in the clinical laboratory is the cornerstone for risk stratification and treatment decisions in many malignant tumors. Studies have reported that targeted next-generation sequencing assays have the potential to identify such rearrangements; however, their utility in the clinical laboratory is unknown. We examine the sensitivity and specificity of ALK and KMT2A (MLL) rearrangement detection by next-generation sequencing in the clinical laboratory. We analyzed a series of seven ALK rearranged cancers, six KMT2A rearranged leukemias, and 77 ALK/KMT2A rearrangement-negative cancers, previously tested by fluorescence in situ hybridization (FISH). Rearrangement detection was tested using publicly available software tools, including Breakdancer, ClusterFAST, CREST, and Hydra. Using Breakdancer and ClusterFAST, we detected ALK rearrangements in seven of seven FISH-positive cases and KMT2A rearrangements in six of six FISH-positive cases. Among the 77 ALK/KMT2A FISH-negative cases, no false-positive identifications were made by Breakdancer or ClusterFAST. Further, we identified one ALK rearranged case with a noncanonical intron 16 breakpoint, which is likely to affect its response to targeted inhibitors. We report that clinically relevant chromosomal rearrangements can be detected from targeted gene panel-based next-generation sequencing with sensitivity and specificity equivalent to that of FISH while providing finer-scale information and increased efficiency for molecular oncology testing.
Collapse
Affiliation(s)
- Haley J Abel
- Department of Genetics, Washington University, St. Louis, Missouri
| | - Hussam Al-Kateb
- Department of Pathology and Immunology, Washington University, St. Louis, Missouri
| | - Catherine E Cottrell
- Department of Pathology and Immunology, Washington University, St. Louis, Missouri
| | - Andrew J Bredemeyer
- Department of Pathology and Immunology, Washington University, St. Louis, Missouri
| | - Colin C Pritchard
- Department of Laboratory Medicine, University of Washington, Seattle, Washington
| | - Allie H Grossmann
- Department of Pathology, University of Utah and ARUP Laboratories, Salt Lake City, Utah
| | | | - John D Pfeifer
- Department of Pathology and Immunology, Washington University, St. Louis, Missouri
| | - Christina M Lockwood
- Department of Pathology and Immunology, Washington University, St. Louis, Missouri
| | - Eric J Duncavage
- Department of Pathology and Immunology, Washington University, St. Louis, Missouri.
| |
Collapse
|
316
|
Zhu Y, Li M, Sousa AMM, Sestan N. XSAnno: a framework for building ortholog models in cross-species transcriptome comparisons. BMC Genomics 2014; 15:343. [PMID: 24884593 PMCID: PMC4035071 DOI: 10.1186/1471-2164-15-343] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2014] [Accepted: 04/24/2014] [Indexed: 02/04/2023] Open
Abstract
Background The accurate characterization of RNA transcripts and expression levels across species is critical for understanding transcriptome evolution. As available RNA-seq data accumulate rapidly, there is a great demand for tools that build gene annotations for cross-species RNA-seq analysis. However, prevailing methods of ortholog annotation for RNA-seq analysis between closely-related species do not take inter-species variation in mappability into consideration. Results Here we present XSAnno, a computational framework that integrates previous approaches with multiple filters to improve the accuracy of inter-species transcriptome comparisons. The implementation of this approach in comparing RNA-seq data of human, chimpanzee, and rhesus macaque brain transcriptomes has reduced the false discovery of differentially expressed genes, while maintaining a low false negative rate. Conclusion The present study demonstrates the utility of the XSAnno pipeline in building ortholog annotations and improving the accuracy of cross-species transcriptome comparisons. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-343) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | - Nenad Sestan
- Department of Neurobiology, Kavli Institute for Neuroscience, Yale School of Medicine, 06510 New Haven, CT, USA.
| |
Collapse
|
317
|
Supek F, Miñana B, Valcárcel J, Gabaldón T, Lehner B. Synonymous Mutations Frequently Act as Driver Mutations in Human Cancers. Cell 2014; 156:1324-1335. [DOI: 10.1016/j.cell.2014.01.051] [Citation(s) in RCA: 331] [Impact Index Per Article: 33.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 11/20/2013] [Accepted: 01/15/2014] [Indexed: 01/05/2023]
|
318
|
Investigating and correcting plasma DNA sequencing coverage bias to enhance aneuploidy discovery. PLoS One 2014; 9:e86993. [PMID: 24489824 PMCID: PMC3906086 DOI: 10.1371/journal.pone.0086993] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2013] [Accepted: 12/16/2013] [Indexed: 12/11/2022] Open
Abstract
Pregnant women carry a mixture of cell-free DNA fragments from self and fetus (non-self) in their circulation. In recent years multiple independent studies have demonstrated the ability to detect fetal trisomies such as trisomy 21, the cause of Down syndrome, by Next-Generation Sequencing of maternal plasma. The current clinical tests based on this approach show very high sensitivity and specificity, although as yet they have not become the standard diagnostic test. Here we describe improvements to the analysis of the sequencing data by reducing GC bias and better handling of the genomic repeats. We show substantial improvements in the sensitivity of the standard trisomy 21 statistical tests, which we measure by artificially reducing read coverage. We also explore the bias stemming from the natural cleavage of plasma DNA by examining DNA motifs and position specific base distributions. We propose a model to correct this fragmentation bias and observe that incorporating this bias does not lead to any further improvements in the detection of fetal trisomy. The improved bias corrections that we demonstrate in this work can be readily adopted into existing fetal trisomy detection protocols and should also lead to improvements in sub-chromosomal copy number variation detection.
Collapse
|
319
|
Maticzka D, Lange SJ, Costa F, Backofen R. GraphProt: modeling binding preferences of RNA-binding proteins. Genome Biol 2014; 15:R17. [PMID: 24451197 PMCID: PMC4053806 DOI: 10.1186/gb-2014-15-1-r17] [Citation(s) in RCA: 187] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2013] [Accepted: 01/22/2014] [Indexed: 12/01/2022] Open
Abstract
We present GraphProt, a computational framework for learning sequence- and structure-binding preferences of RNA-binding proteins (RBPs) from high-throughput experimental data. We benchmark GraphProt, demonstrating that the modeled binding preferences conform to the literature, and showcase the biological relevance and two applications of GraphProt models. First, estimated binding affinities correlate with experimental measurements. Second, predicted Ago2 targets display higher levels of expression upon Ago2 knockdown, whereas control targets do not. Computational binding models, such as those provided by GraphProt, are essential for predicting RBP binding sites and affinities in all tissues. GraphProt is freely available at http://www.bioinf.uni-freiburg.de/Software/GraphProt.
Collapse
|
320
|
|
321
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
322
|
Ozer HG, Usubalieva A, Dorrance A, Yilmaz AS, Caligiuri M, Marcucci G, Huang K. Identification of medium-sized copy number alterations in whole-genome sequencing. Cancer Inform 2014; 13:105-11. [PMID: 25788829 PMCID: PMC4356486 DOI: 10.4137/cin.s14023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2014] [Revised: 12/29/2014] [Accepted: 01/04/2015] [Indexed: 11/05/2022] Open
Abstract
The genome-wide discoveries such as detection of copy number alterations (CNA) from high-throughput whole-genome sequencing data enabled new developments in personalized medicine. The CNAs have been reported to be associated with various diseases and cancers including acute myeloid leukemia. However, there are multiple challenges to the use of current CNA detection tools that lead to high false-positive rates and thus impede widespread use of such tools in cancer research. In this paper, we discuss these issues and propose possible solutions. First, since the entire genome cannot be mapped due to some regions lacking sequence uniqueness, current methods cannot be appropriately adjusted to handle these regions in the analyses. Thus, detection of medium-sized CNAs is also being directly affected by these mappability problems. The requirement for matching control samples is also an important limitation because acquiring matching controls might not be possible or might not be cost efficient. Here we present an approach that addresses these issues and detects medium-sized CNAs in cancer genomes by (1) masking unmappable regions during the initial CNA detection phase, (2) using pool of a few normal samples as control, and (3) employing median filtering to adjust CNA ratios to its surrounding coverage and eliminate false positives.
Collapse
Affiliation(s)
- Hatice Gulcin Ozer
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Aisulu Usubalieva
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Adrienne Dorrance
- Division of Hematology, Department of Medicine, The Ohio State University, Columbus, OH, USA
| | - Ayse Selen Yilmaz
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Michael Caligiuri
- Division of Hematology, Department of Medicine, The Ohio State University, Columbus, OH, USA
| | - Guido Marcucci
- Division of Hematology, Department of Medicine, The Ohio State University, Columbus, OH, USA
| | - Kun Huang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
323
|
Whole-exome sequencing in splenic marginal zone lymphoma reveals mutations in genes involved in marginal zone differentiation. Leukemia 2013; 28:1334-40. [PMID: 24296945 DOI: 10.1038/leu.2013.365] [Citation(s) in RCA: 78] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2013] [Revised: 11/21/2013] [Accepted: 11/25/2013] [Indexed: 01/12/2023]
Abstract
Splenic marginal zone lymphoma (SMZL) is a B-cell neoplasm whose molecular pathogenesis remains fundamentally unexplained, requiring more precise diagnostic markers. Previous molecular studies have revealed 7q loss and mutations of nuclear factor κB (NF-κB), B-cell receptor (BCR) and Notch signalling genes. We performed whole-exome sequencing in a series of SMZL cases. Results confirmed that SMZL is an entity distinct from other low-grade B-cell lymphomas, and identified mutations in multiple genes involved in marginal zone development, and others involved in NF-κB, BCR, chromatin remodelling and the cytoskeleton.
Collapse
|
324
|
Bacolla A, Temiz NA, Yi M, Ivanic J, Cer RZ, Donohue DE, Ball EV, Mudunuri US, Wang G, Jain A, Volfovsky N, Luke BT, Stephens RM, Cooper DN, Collins JR, Vasquez KM. Guanine holes are prominent targets for mutation in cancer and inherited disease. PLoS Genet 2013; 9:e1003816. [PMID: 24086153 PMCID: PMC3784513 DOI: 10.1371/journal.pgen.1003816] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2013] [Accepted: 08/07/2013] [Indexed: 12/27/2022] Open
Abstract
Single base substitutions constitute the most frequent type of human gene mutation and are a leading cause of cancer and inherited disease. These alterations occur non-randomly in DNA, being strongly influenced by the local nucleotide sequence context. However, the molecular mechanisms underlying such sequence context-dependent mutagenesis are not fully understood. Using bioinformatics, computational and molecular modeling analyses, we have determined the frequencies of mutation at G • C bp in the context of all 64 5'-NGNN-3' motifs that contain the mutation at the second position. Twenty-four datasets were employed, comprising >530,000 somatic single base substitutions from 21 cancer genomes, >77,000 germline single-base substitutions causing or associated with human inherited disease and 16.7 million benign germline single-nucleotide variants. In several cancer types, the number of mutated motifs correlated both with the free energies of base stacking and the energies required for abstracting an electron from the target guanines (ionization potentials). Similar correlations were also evident for the pathological missense and nonsense germline mutations, but only when the target guanines were located on the non-transcribed DNA strand. Likewise, pathogenic splicing mutations predominantly affected positions in which a purine was located on the non-transcribed DNA strand. Novel candidate driver mutations and tissue-specific mutational patterns were also identified in the cancer datasets. We conclude that electron transfer reactions within the DNA molecule contribute to sequence context-dependent mutagenesis, involving both somatic driver and passenger mutations in cancer, as well as germline alterations causing or associated with inherited disease.
Collapse
Affiliation(s)
- Albino Bacolla
- Division of Pharmacology and Toxicology, The University of Texas at Austin, Dell Pediatric Research Institute, Austin, Texas, United States of America
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Nuri A. Temiz
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Ming Yi
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Joseph Ivanic
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Regina Z. Cer
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Duncan E. Donohue
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Edward V. Ball
- Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, United Kingdom
| | - Uma S. Mudunuri
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Guliang Wang
- Division of Pharmacology and Toxicology, The University of Texas at Austin, Dell Pediatric Research Institute, Austin, Texas, United States of America
| | - Aklank Jain
- Division of Pharmacology and Toxicology, The University of Texas at Austin, Dell Pediatric Research Institute, Austin, Texas, United States of America
| | - Natalia Volfovsky
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Brian T. Luke
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Robert M. Stephens
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - David N. Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, United Kingdom
| | - Jack R. Collins
- Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America
| | - Karen M. Vasquez
- Division of Pharmacology and Toxicology, The University of Texas at Austin, Dell Pediatric Research Institute, Austin, Texas, United States of America
| |
Collapse
|
325
|
Cabanski CR, Wilkerson MD, Soloway M, Parker JS, Liu J, Prins JF, Marron JS, Perou CM, Hayes DN. BlackOPs: increasing confidence in variant detection through mappability filtering. Nucleic Acids Res 2013; 41:e178. [PMID: 23935067 PMCID: PMC3799449 DOI: 10.1093/nar/gkt692] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin ('mismapping') and inferring differences in mismapped sequences to be true variants. We developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. We show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. We queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. We demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.
Collapse
Affiliation(s)
- Christopher R Cabanski
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA, The Genome Institute at Washington University, St. Louis, MO 63108, USA, Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA, Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA, Department of Computer Science, University of Kentucky, Lexington, KY 40506, USA, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA and Division of Medical Oncology, Department of Internal Medicine, University of North Carolina, Chapel Hill, NC 27599, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
326
|
Stevenson KR, Coolon JD, Wittkopp PJ. Sources of bias in measures of allele-specific expression derived from RNA-sequence data aligned to a single reference genome. BMC Genomics 2013; 14:536. [PMID: 23919664 PMCID: PMC3751238 DOI: 10.1186/1471-2164-14-536] [Citation(s) in RCA: 91] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2013] [Accepted: 08/05/2013] [Indexed: 11/23/2022] Open
Abstract
Background RNA-seq can be used to measure allele-specific expression (ASE) by assigning sequence reads to individual alleles; however, relative ASE is systematically biased when sequence reads are aligned to a single reference genome. Aligning sequence reads to both parental genomes can eliminate this bias, but this approach is not always practical, especially for non-model organisms. To improve accuracy of ASE measured using a single reference genome, we identified properties of differentiating sites responsible for biased measures of relative ASE. Results We found that clusters of differentiating sites prevented sequence reads from an alternate allele from aligning to the reference genome, causing a bias in relative ASE favoring the reference allele. This bias increased with greater sequence divergence between alleles. Increasing the number of mismatches allowed when aligning sequence reads to the reference genome and restricting analysis to genomic regions with fewer differentiating sites than the number of mismatches allowed almost completely eliminated this systematic bias. Accuracy of allelic abundance was increased further by excluding differentiating sites within sequence reads that could not be aligned uniquely within the genome (imperfect mappability) and reads that overlapped one or more insertions or deletions (indels) between alleles. Conclusions After aligning sequence reads to a single reference genome, excluding differentiating sites with at least as many neighboring differentiating sites as the number of mismatches allowed, imperfect mappability, and/or an indel(s) nearby resulted in measures of allelic abundance comparable to those derived from aligning sequence reads to both parental genomes.
Collapse
Affiliation(s)
- Kraig R Stevenson
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | | | | |
Collapse
|
327
|
Bussotti G, Notredame C, Enright AJ. Detecting and comparing non-coding RNAs in the high-throughput era. Int J Mol Sci 2013; 14:15423-58. [PMID: 23887659 PMCID: PMC3759867 DOI: 10.3390/ijms140815423] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Revised: 07/16/2013] [Accepted: 07/17/2013] [Indexed: 02/07/2023] Open
Abstract
In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data.
Collapse
Affiliation(s)
- Giovanni Bussotti
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; E-Mail:
| | - Cedric Notredame
- Bioinformatics and Genomics Program, Centre for Genomic Regulation (CRG), Aiguader, 88, 08003 Barcelona, Spain; E-Mail:
| | - Anton J. Enright
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; E-Mail:
| |
Collapse
|
328
|
DNase I-hypersensitive exons colocalize with promoters and distal regulatory elements. Nat Genet 2013; 45:852-9. [PMID: 23793028 DOI: 10.1038/ng.2677] [Citation(s) in RCA: 92] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2013] [Accepted: 05/30/2013] [Indexed: 12/18/2022]
Abstract
The precise splicing of genes confers an enormous transcriptional complexity to the human genome. The majority of gene splicing occurs cotranscriptionally, permitting epigenetic modifications to affect splicing outcomes. Here we show that select exonic regions are demarcated within the three-dimensional structure of the human genome. We identify a subset of exons that exhibit DNase I hypersensitivity and are accompanied by 'phantom' signals in chromatin immunoprecipitation and sequencing (ChIP-seq) that result from cross-linking with proximal promoter- or enhancer-bound factors. The capture of structural features by ChIP-seq is confirmed by chromatin interaction analysis that resolves local intragenic loops that fold exons close to cognate promoters while excluding intervening intronic sequences. These interactions of exons with promoters and enhancers are enriched for alternative splicing events, an effect reflected in cell type-specific periexonic DNase I hypersensitivity patterns. Collectively, our results connect local genome topography, chromatin structure and cis-regulatory landscapes with the generation of human transcriptional complexity by cotranscriptional splicing.
Collapse
|
329
|
Hosseini M, Goodstadt L, Hughes JR, Kowalczyk MS, de Gobbi M, Otto GW, Copley RR, Mott R, Higgs DR, Flint J. Causes and consequences of chromatin variation between inbred mice. PLoS Genet 2013; 9:e1003570. [PMID: 23785304 PMCID: PMC3681629 DOI: 10.1371/journal.pgen.1003570] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 05/02/2013] [Indexed: 12/28/2022] Open
Abstract
Variation at regulatory elements, identified through hypersensitivity to digestion by DNase I, is believed to contribute to variation in complex traits, but the extent and consequences of this variation are poorly characterized. Analysis of terminally differentiated erythroblasts in eight inbred strains of mice identified reproducible variation at approximately 6% of DNase I hypersensitive sites (DHS). Only 30% of such variable DHS contain a sequence variant predictive of site variation. Nevertheless, sequence variants within variable DHS are more likely to be associated with complex traits than those in non-variant DHS, and variants associated with complex traits preferentially occur in variable DHS. Changes at a small proportion (less than 10%) of variable DHS are associated with changes in nearby transcriptional activity. Our results show that whilst DNA sequence variation is not the major determinant of variation in open chromatin, where such variants exist they are likely to be causal for complex traits. Regulatory sites of the genome affect gene expression and complex traits, including disease susceptibility. Variable regulatory sites are potentially interesting because they are a likely cause of phenotypic variation, providing a bridge between sequence and transcriptional variation. In this paper we identify regions of the genome where DNA is not wrapped up in chromatin (hence potentially regulatory) in eight inbred strains of mice. We compare sites that vary among strains and compare them to non-variable sites. We show that more than half of variable sites cannot be attributed to local sequence variation. Functional consequences (in terms of readily detectable changes in gene expression) are associated with less than 10% of variable DNase I hypersensitive sites. We show that variable sites are enriched for sequence variants contributing to complex traits in mice.
Collapse
Affiliation(s)
- Mona Hosseini
- Wellcome Trust Centre for Human Genetics, Oxford, United Kingdom
| | - Leo Goodstadt
- Wellcome Trust Centre for Human Genetics, Oxford, United Kingdom
| | - Jim R. Hughes
- MRC Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, United Kingdom
| | - Monika S. Kowalczyk
- MRC Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, United Kingdom
| | - Marco de Gobbi
- MRC Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, United Kingdom
| | - Georg W. Otto
- Wellcome Trust Centre for Human Genetics, Oxford, United Kingdom
| | | | - Richard Mott
- Wellcome Trust Centre for Human Genetics, Oxford, United Kingdom
| | - Douglas R. Higgs
- MRC Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, United Kingdom
| | - Jonathan Flint
- Wellcome Trust Centre for Human Genetics, Oxford, United Kingdom
- * E-mail:
| |
Collapse
|
330
|
Witherspoon DJ, Zhang Y, Xing J, Watkins WS, Ha H, Batzer MA, Jorde LB. Mobile element scanning (ME-Scan) identifies thousands of novel Alu insertions in diverse human populations. Genome Res 2013; 23:1170-81. [PMID: 23599355 PMCID: PMC3698510 DOI: 10.1101/gr.148973.112] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Alu retrotransposons are the most numerous and active mobile elements in humans, causing genetic disease and creating genomic diversity. Mobile element scanning (ME-Scan) enables comprehensive and affordable identification of mobile element insertions (MEI) using targeted high-throughput sequencing of multiplexed MEI junction libraries. In a single experiment, ME-Scan identifies nearly all AluYb8 and AluYb9 elements, with high sensitivity for both rare and common insertions, in 169 individuals of diverse ancestry. ME-Scan detects heterozygous insertions in single individuals with 91% sensitivity. Insertion presence or absence states determined by ME-Scan are 95% concordant with those determined by locus-specific PCR assays. By sampling diverse populations from Africa, South Asia, and Europe, we are able to identify 5799 Alu insertions, including 2524 novel ones, some of which occur in exons. Sub-Saharan populations and a Pygmy group in particular carry numerous intermediate-frequency Alu insertions that are absent in non-African groups. There is a significant dearth of exon-interrupting insertions among common Alu polymorphisms, but the density of singleton Alu insertions is constant across exonic and nonexonic regions. In one case, a validated novel singleton Alu interrupts a protein-coding exon of FAM187B. This implies that exonic Alu insertions are generally deleterious and thus eliminated by natural selection, but not so quickly that they cannot be observed as extremely rare variants.
Collapse
Affiliation(s)
- David J Witherspoon
- Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
| | | | | | | | | | | | | |
Collapse
|
331
|
Melià MJ, Kubota A, Ortolano S, Vílchez JJ, Gámez J, Tanji K, Bonilla E, Palenzuela L, Fernández-Cadenas I, Pristoupilová A, García-Arumí E, Andreu AL, Navarro C, Hirano M, Martí R. Limb-girdle muscular dystrophy 1F is caused by a microdeletion in the transportin 3 gene. ACTA ACUST UNITED AC 2013; 136:1508-17. [PMID: 23543484 PMCID: PMC3634201 DOI: 10.1093/brain/awt074] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
In 2001, we reported linkage of an autosomal dominant form of limb-girdle muscular dystrophy, limb-girdle muscular dystrophy 1F, to chromosome 7q32.1-32.2, but the identity of the mutant gene was elusive. Here, using a whole genome sequencing strategy, we identified the causative mutation of limb-girdle muscular dystrophy 1F, a heterozygous single nucleotide deletion (c.2771del) in the termination codon of transportin 3 (TNPO3). This gene is situated within the chromosomal region linked to the disease and encodes a nuclear membrane protein belonging to the importin beta family. TNPO3 transports serine/arginine-rich proteins into the nucleus, and has been identified as a key factor in the HIV-import process into the nucleus. The mutation is predicted to generate a 15-amino acid extension of the C-terminus of the protein, segregates with the clinical phenotype, and is absent in genomic sequence databases and a set of >200 control alleles. In skeletal muscle of affected individuals, expression of the mutant messenger RNA and histological abnormalities of nuclei and TNPO3 indicate altered TNPO3 function. Our results demonstrate that the TNPO3 mutation is the cause of limb-girdle muscular dystrophy 1F, expand our knowledge of the molecular basis of muscular dystrophies and bolster the importance of defects of nuclear envelope proteins as causes of inherited myopathies.
Collapse
Affiliation(s)
- Maria J Melià
- Research Group on Neuromuscular and Mitochondrial Disorders, Vall d'Hebron Institut de Recerca, VHIR, Universitat Autònoma de Barcelona, Passeig Vall d'Hebron, 119-129 08035 Barcelona, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
332
|
Quinn EM, Cormican P, Kenny EM, Hill M, Anney R, Gill M, Corvin AP, Morris DW. Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data. PLoS One 2013; 8:e58815. [PMID: 23555596 PMCID: PMC3608647 DOI: 10.1371/journal.pone.0058815] [Citation(s) in RCA: 101] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 02/07/2013] [Indexed: 11/24/2022] Open
Abstract
Next-generation RNA sequencing (RNA-seq) maps and analyzes transcriptomes and generates data on sequence variation in expressed genes. There are few reported studies on analysis strategies to maximize the yield of quality RNA-seq SNP data. We evaluated the performance of different SNP-calling methods following alignment to both genome and transcriptome by applying them to RNA-seq data from a HapMap lymphoblastoid cell line sample and comparing results with sequence variation data from 1000 Genomes. We determined that the best method to achieve high specificity and sensitivity, and greatest number of SNP calls, is to remove duplicate sequence reads after alignment to the genome and to call SNPs using SAMtools. The accuracy of SNP calls is dependent on sequence coverage available. In terms of specificity, 89% of RNA-seq SNPs calls were true variants where coverage is >10X. In terms of sensitivity, at >10X coverage 92% of all expected SNPs in expressed exons could be detected. Overall, the results indicate that RNA-seq SNP data are a very useful by-product of sequence-based transcriptome analysis. If RNA-seq is applied to disease tissue samples and assuming that genes carrying mutations relevant to disease biology are being expressed, a very high proportion of these mutations can be detected.
Collapse
Affiliation(s)
- Emma M. Quinn
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Paul Cormican
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Elaine M. Kenny
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Matthew Hill
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Richard Anney
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Michael Gill
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Aiden P. Corvin
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| | - Derek W. Morris
- TrinSeq and Neuropsychiatric Genetics Research Group, Department of Psychiatry and Institute of Molecular Medicine, Trinity College Dublin, Dublin, Ireland
| |
Collapse
|
333
|
Jensen TJ, Zwiefelhofer T, Tim RC, Džakula Ž, Kim SK, Mazloom AR, Zhu Z, Tynan J, Lu T, McLennan G, Palomaki GE, Canick JA, Oeth P, Deciu C, van den Boom D, Ehrich M. High-throughput massively parallel sequencing for fetal aneuploidy detection from maternal plasma. PLoS One 2013; 8:e57381. [PMID: 23483908 PMCID: PMC3590217 DOI: 10.1371/journal.pone.0057381] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2012] [Accepted: 01/21/2013] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Circulating cell-free (ccf) fetal DNA comprises 3-20% of all the cell-free DNA present in maternal plasma. Numerous research and clinical studies have described the analysis of ccf DNA using next generation sequencing for the detection of fetal aneuploidies with high sensitivity and specificity. We sought to extend the utility of this approach by assessing semi-automated library preparation, higher sample multiplexing during sequencing, and improved bioinformatic tools to enable a higher throughput, more efficient assay while maintaining or improving clinical performance. METHODS Whole blood (10mL) was collected from pregnant female donors and plasma separated using centrifugation. Ccf DNA was extracted using column-based methods. Libraries were prepared using an optimized semi-automated library preparation method and sequenced on an Illumina HiSeq2000 sequencer in a 12-plex format. Z-scores were calculated for affected chromosomes using a robust method after normalization and genomic segment filtering. Classification was based upon a standard normal transformed cutoff value of z = 3 for chromosome 21 and z = 3.95 for chromosomes 18 and 13. RESULTS Two parallel assay development studies using a total of more than 1900 ccf DNA samples were performed to evaluate the technical feasibility of automating library preparation and increasing the sample multiplexing level. These processes were subsequently combined and a study of 1587 samples was completed to verify the stability of the process-optimized assay. Finally, an unblinded clinical evaluation of 1269 euploid and aneuploid samples utilizing this high-throughput assay coupled to improved bioinformatic procedures was performed. We were able to correctly detect all aneuploid cases with extremely low false positive rates of 0.09%, <0.01%, and 0.08% for trisomies 21, 18, and 13, respectively. CONCLUSIONS These data suggest that the developed laboratory methods in concert with improved bioinformatic approaches enable higher sample throughput while maintaining high classification accuracy.
Collapse
Affiliation(s)
- Taylor J. Jensen
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Tricia Zwiefelhofer
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Roger C. Tim
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Željko Džakula
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Sung K. Kim
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Amin R. Mazloom
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Zhanyang Zhu
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - John Tynan
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Tim Lu
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Graham McLennan
- Research and Development, Sequenom Inc., San Diego, California, United States of America
| | - Glenn E. Palomaki
- Women and Infants Hospital, Alpert Medical School of Brown University, Providence, Rhode Island, United States of America
| | - Jacob A. Canick
- Women and Infants Hospital, Alpert Medical School of Brown University, Providence, Rhode Island, United States of America
| | - Paul Oeth
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Cosmin Deciu
- Research and Development, Sequenom Center for Molecular Medicine, San Diego, California, United States of America
| | - Dirk van den Boom
- Research and Development, Sequenom Inc., San Diego, California, United States of America
| | - Mathias Ehrich
- Research and Development, Sequenom Inc., San Diego, California, United States of America
- * E-mail:
| |
Collapse
|
334
|
Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses. PLoS One 2013; 8:e53822. [PMID: 23349747 PMCID: PMC3548888 DOI: 10.1371/journal.pone.0053822] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2012] [Accepted: 12/03/2012] [Indexed: 01/22/2023] Open
Abstract
As next generation sequencing technologies are getting more efficient and less expensive, RNA-Seq is becoming a widely used technique for transcriptome studies. Computational analysis of RNA-Seq data often starts with the mapping of millions of short reads back to the genome or transcriptome, a process in which some reads are found to map equally well to multiple genomic locations (multimapping reads). We have developed the Minimum Unique Length Tool (MULTo), a framework for efficient and comprehensive representation of mappability information, through identification of the shortest possible length required for each genomic coordinate to become unique in the genome and transcriptome. Using the minimum unique length information, we have compared different uniqueness compensation approaches for transcript expression level quantification and demonstrate that the best compensation is achieved by discarding multimapping reads and correctly adjusting gene model lengths. We have also explored uniqueness within specific regions of the mouse genome and enhancer mapping experiments. Finally, by making MULTo available to the community we hope to facilitate the use of uniqueness compensation in RNA-Seq analysis and to eliminate the need to make additional mappability files.
Collapse
|
335
|
Endale Ahanda ML, Fritz ER, Estellé J, Hu ZL, Madsen O, Groenen MAM, Beraldi D, Kapetanovic R, Hume DA, Rowland RRR, Lunney JK, Rogel-Gaillard C, Reecy JM, Giuffra E. Prediction of altered 3'- UTR miRNA-binding sites from RNA-Seq data: the swine leukocyte antigen complex (SLA) as a model region. PLoS One 2012; 7:e48607. [PMID: 23139801 PMCID: PMC3490867 DOI: 10.1371/journal.pone.0048607] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2012] [Accepted: 09/27/2012] [Indexed: 01/09/2023] Open
Abstract
THE SLA (swine leukocyte antigen, MHC: SLA) genes are the most important determinants of immune, infectious disease and vaccine response in pigs; several genetic associations with immunity and swine production traits have been reported. However, most of the current knowledge on SLA is limited to gene coding regions. MicroRNAs (miRNAs) are small molecules that post-transcriptionally regulate the expression of a large number of protein-coding genes in metazoans, and are suggested to play important roles in fine-tuning immune mechanisms and disease responses. Polymorphisms in either miRNAs or their gene targets may have a significant impact on gene expression by abolishing, weakening or creating miRNA target sites, possibly leading to phenotypic variation. We explored the impact of variants in the 3'-UTR miRNA target sites of genes within the whole SLA region. The combined predictions by TargetScan, PACMIT and TargetSpy, based on different biological parameters, empowered the identification of miRNA target sites and the discovery of polymorphic miRNA target sites (poly-miRTSs). Predictions for three SLA genes characterized by a different range of sequence variation provided proof of principle for the analysis of poly-miRTSs from a total of 144 M RNA-Seq reads collected from different porcine tissues. Twenty-four novel SNPs were predicted to affect miRNA-binding sites in 19 genes of the SLA region. Seven of these genes (SLA-1, SLA-6, SLA-DQA, SLA-DQB1, SLA-DOA, SLA-DOB and TAP1) are linked to antigen processing and presentation functions, which is reminiscent of associations with disease traits reported for altered miRNA binding to MHC genes in humans. An inverse correlation in expression levels was demonstrated between miRNAs and co-expressed SLA targets by exploiting a published dataset (RNA-Seq and small RNA-Seq) of three porcine tissues. Our results support the resource value of RNA-Seq collections to identify SNPs that may lead to altered miRNA regulation patterns.
Collapse
Affiliation(s)
- Marie-Laure Endale Ahanda
- INRA, UMR 1313 de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
- CEA, DSV, IRCM, SREIT, Laboratoire de Radiobiologie et Etude du Génome, Domaine de Vilvert, Jouy-en-Josas, France
- AgroParisTech, Laboratoire de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
| | - Eric R. Fritz
- Department of Animal Science and Center for Integrated Animal Genomics, Iowa State University, Ames, Iowa, United States of America
| | - Jordi Estellé
- INRA, UMR 1313 de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
- CEA, DSV, IRCM, SREIT, Laboratoire de Radiobiologie et Etude du Génome, Domaine de Vilvert, Jouy-en-Josas, France
- AgroParisTech, Laboratoire de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
| | - Zhi-Liang Hu
- Department of Animal Science and Center for Integrated Animal Genomics, Iowa State University, Ames, Iowa, United States of America
| | - Ole Madsen
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, The Netherlands
| | - Martien A. M. Groenen
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, The Netherlands
| | - Dario Beraldi
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, United Kingdom
| | - Ronan Kapetanovic
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, United Kingdom
| | - David A. Hume
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, United Kingdom
| | - Robert R. R. Rowland
- Department of Diagnostic Medicine and Pathobiology, Kansas State University, Manhattan, Kansas, United States of America
| | - Joan K. Lunney
- Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, Maryland, United States of America
| | - Claire Rogel-Gaillard
- INRA, UMR 1313 de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
- CEA, DSV, IRCM, SREIT, Laboratoire de Radiobiologie et Etude du Génome, Domaine de Vilvert, Jouy-en-Josas, France
- AgroParisTech, Laboratoire de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
| | - James M. Reecy
- Department of Animal Science and Center for Integrated Animal Genomics, Iowa State University, Ames, Iowa, United States of America
| | - Elisabetta Giuffra
- INRA, UMR 1313 de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
- CEA, DSV, IRCM, SREIT, Laboratoire de Radiobiologie et Etude du Génome, Domaine de Vilvert, Jouy-en-Josas, France
- AgroParisTech, Laboratoire de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, Jouy-en-Josas, France
- * E-mail:
| |
Collapse
|