351
|
Modeling read counts for CNV detection in exome sequencing data. Stat Appl Genet Mol Biol 2011; 10:/j/sagmb.2011.10.issue-1/1544-6115.1732/1544-6115.1732.xml. [PMID: 23089826 DOI: 10.2202/1544-6115.1732] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Varying depth of high-throughput sequencing reads along a chromosome makes it possible to observe copy number variants (CNVs) in a sample relative to a reference. In exome and other targeted sequencing projects, technical factors increase variation in read depth while reducing the number of observed locations, adding difficulty to the problem of identifying CNVs. We present a hidden Markov model for detecting CNVs from raw read count data, using background read depth from a control set as well as other positional covariates such as GC-content. The model, exomeCopy, is applied to a large chromosome X exome sequencing project identifying a list of large unique CNVs. CNVs predicted by the model and experimentally validated are then recovered using a cross-platform control set from publicly available exome sequencing data. Simulations show high sensitivity for detecting heterozygous and homozygous CNVs, outperforming normalization and state-of-the-art segmentation methods.
Collapse
|
352
|
Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. ACTA ACUST UNITED AC 2011; 28:40-7. [PMID: 22039209 DOI: 10.1093/bioinformatics/btr593] [Citation(s) in RCA: 120] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Comparison of read depths from next-generation sequencing between cancer and normal cells makes the estimation of copy number alteration (CNA) possible, even at very low coverage. However, estimating CNA from patients' tumour samples poses considerable challenges due to infiltration with normal cells and aneuploid cancer genomes. Here we provide a method that corrects contamination with normal cells and adjusts for genomes of different sizes so that the actual copy number of each region can be estimated. RESULTS The procedure consists of several steps. First, we identify the multi-modality of the distribution of smoothed ratios. Then we use the estimates of the mean (modes) to identify underlying ploidy and the contamination level, and finally we perform the correction. The results indicate that the method works properly to estimate genomic regions with gains and losses in a range of simulated data as well as in two datasets from lung cancer patients. It also proves a powerful tool when analysing publicly available data from two cell lines (HCC1143 and COLO829). AVAILABILITY An R package, called CNAnorm, is available at http://www.precancer.leeds.ac.uk/cnanorm or from Bioconductor. CONTACT a.gusnanto@leeds.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arief Gusnanto
- Department of Statistics, University of Leeds, Leeds LS2 9JT, UK
| | | | | | | | | |
Collapse
|
353
|
Szpara ML, Tafuri YR, Parsons L, Shamim SR, Verstrepen KJ, Legendre M, Enquist LW. A wide extent of inter-strain diversity in virulent and vaccine strains of alphaherpesviruses. PLoS Pathog 2011; 7:e1002282. [PMID: 22022263 PMCID: PMC3192842 DOI: 10.1371/journal.ppat.1002282] [Citation(s) in RCA: 122] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2011] [Accepted: 08/10/2011] [Indexed: 12/17/2022] Open
Abstract
Alphaherpesviruses are widespread in the human population, and include herpes simplex virus 1 (HSV-1) and 2, and varicella zoster virus (VZV). These viral pathogens cause epithelial lesions, and then infect the nervous system to cause lifelong latency, reactivation, and spread. A related veterinary herpesvirus, pseudorabies (PRV), causes similar disease in livestock that result in significant economic losses. Vaccines developed for VZV and PRV serve as useful models for the development of an HSV-1 vaccine. We present full genome sequence comparisons of the PRV vaccine strain Bartha, and two virulent PRV isolates, Kaplan and Becker. These genome sequences were determined by high-throughput sequencing and assembly, and present new insights into the attenuation of a mammalian alphaherpesvirus vaccine strain. We find many previously unknown coding differences between PRV Bartha and the virulent strains, including changes to the fusion proteins gH and gB, and over forty other viral proteins. Inter-strain variation in PRV protein sequences is much closer to levels previously observed for HSV-1 than for the highly stable VZV proteome. Almost 20% of the PRV genome contains tandem short sequence repeats (SSRs), a class of nucleic acids motifs whose length-variation has been associated with changes in DNA binding site efficiency, transcriptional regulation, and protein interactions. We find SSRs throughout the herpesvirus family, and provide the first global characterization of SSRs in viruses, both within and between strains. We find SSR length variation between different isolates of PRV and HSV-1, which may provide a new mechanism for phenotypic variation between strains. Finally, we detected a small number of polymorphic bases within each plaque-purified PRV strain, and we characterize the effect of passage and plaque-purification on these polymorphisms. These data add to growing evidence that even plaque-purified stocks of stable DNA viruses exhibit limited sequence heterogeneity, which likely seeds future strain evolution. Alphaherpesviruses such as herpes simplex virus (HSV) are ubiquitous in the human population. HSV causes oral and genital lesions, and has co-morbidities in acquisition and spread of human immunodeficiency virus (HIV). The lack of a vaccine for HSV hinders medical progress for both of these infections. A related veterinary alphaherpesvirus, pseudorabies virus (PRV), has long served as a model for HSV vaccine development, because of their similar pathogenesis, neuronal spread, and infectious cycle. We present here the first full genome characterization of a live PRV vaccine strain, Bartha, and reveal a spectrum of unique mutations that are absent from two divergent wild-type PRV strains. These mutations can now be examined individually for their contribution to vaccine strain attenuation and for potential use in HSV vaccine development. These inter-strain comparisons also revealed an abundance of short repetitive elements in the PRV genome, a pattern which is repeated in other herpesvirus genomes and even the unrelated Mimivirus. We provide the first global characterization of repeats in viruses, comparing both their presence and their variation among different viral strains and species. Repetitive elements such as these have been shown to serve as hotspots of variation between individuals or strains of other organisms, generating adaptations or even disease states through changes in length of DNA-binding sites, protein folding motifs, and other structural elements. These data suggest for the first time that similar mechanisms could be widely distributed in viral biology as well.
Collapse
Affiliation(s)
- Moriah L. Szpara
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Princeton Neuroscience Institute, Princeton University, Princeton, New Jersey, United States of America
| | - Yolanda R. Tafuri
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Lance Parsons
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - S. Rafi Shamim
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Kevin J. Verstrepen
- VIB lab for Systems Biology and CMPG Lab for Genetics and Genomics, KULeuven, Gaston Geenslaan 1, Leuven, Belgium
| | - Matthieu Legendre
- Structural & Genomic Information Laboratory (CNRS, UPR2589), Mediterranean Institute of Microbiology, Aix-Marseille Université, Marseille, France
| | - L. W. Enquist
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Princeton Neuroscience Institute, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
354
|
Onishi-Seebacher M, Korbel JO. Challenges in studying genomic structural variant formation mechanisms: The short-read dilemma and beyond. Bioessays 2011; 33:840-50. [DOI: 10.1002/bies.201100075] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
355
|
Abstract
Copy number variants (CNVs) play an important role in human disease and population diversity. Advancements in technology have allowed for the analysis of CNVs in thousands of individuals with disease in addition to thousands of controls. These studies have identified rare CNVs associated with neuropsychiatric diseases such as autism, schizophrenia, and intellectual disability. In addition, copy number polymorphisms (CNPs) are present at higher frequencies in the population, show high diversity in copy number, sequence, and structure, and have been associated with multiple phenotypes, primarily related to immune or environmental response. However, the landscape of copy number variation still remains largely unexplored, especially for smaller CNVs and those embedded within complex regions of the human genome. An integrated approach including characterization of single nucleotide variants and CNVs in a large number of individuals with disease and normal genomes holds the promise of thoroughly elucidating the genetic basis of human disease and diversity.
Collapse
Affiliation(s)
- Santhosh Girirajan
- Department of Genome Sciences and Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.
| | | | | |
Collapse
|
356
|
Stewart C, Kural D, Strömberg MP, Walker JA, Konkel MK, Stütz AM, Urban AE, Grubert F, Lam HYK, Lee WP, Busby M, Indap AR, Garrison E, Huff C, Xing J, Snyder MP, Jorde LB, Batzer MA, Korbel JO, Marth GT. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet 2011; 7:e1002236. [PMID: 21876680 PMCID: PMC3158055 DOI: 10.1371/journal.pgen.1002236] [Citation(s) in RCA: 229] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2010] [Accepted: 06/24/2011] [Indexed: 11/18/2022] Open
Abstract
As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
Collapse
Affiliation(s)
- Chip Stewart
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Deniz Kural
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Michael P. Strömberg
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Jerilyn A. Walker
- Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana, United States of America
| | - Miriam K. Konkel
- Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana, United States of America
| | - Adrian M. Stütz
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alexander E. Urban
- Department of Genetics, Stanford University, Stanford, California, United States of America
| | - Fabian Grubert
- Department of Genetics, Stanford University, Stanford, California, United States of America
| | - Hugo Y. K. Lam
- Department of Genetics, Stanford University, Stanford, California, United States of America
| | - Wan-Ping Lee
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Michele Busby
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Amit R. Indap
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Erik Garrison
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Chad Huff
- Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, United States of America
| | - Jinchuan Xing
- Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, United States of America
| | - Michael P. Snyder
- Department of Genetics, Stanford University, Stanford, California, United States of America
| | - Lynn B. Jorde
- Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, United States of America
| | - Mark A. Batzer
- Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana, United States of America
| | - Jan O. Korbel
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Gabor T. Marth
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
- * E-mail:
| | | |
Collapse
|
357
|
Robinson T, Campino SG, Auburn S, Assefa SA, Polley SD, Manske M, MacInnis B, Rockett KA, Maslen GL, Sanders M, Quail MA, Chiodini PL, Kwiatkowski DP, Clark TG, Sutherland CJ. Drug-resistant genotypes and multi-clonality in Plasmodium falciparum analysed by direct genome sequencing from peripheral blood of malaria patients. PLoS One 2011; 6:e23204. [PMID: 21853089 PMCID: PMC3154926 DOI: 10.1371/journal.pone.0023204] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Accepted: 07/08/2011] [Indexed: 11/19/2022] Open
Abstract
Naturally acquired blood-stage infections of the malaria parasite Plasmodium falciparum typically harbour multiple haploid clones. The apparent number of clones observed in any single infection depends on the diversity of the polymorphic markers used for the analysis, and the relative abundance of rare clones, which frequently fail to be detected among PCR products derived from numerically dominant clones. However, minority clones are of clinical interest as they may harbour genes conferring drug resistance, leading to enhanced survival after treatment and the possibility of subsequent therapeutic failure. We deployed new generation sequencing to derive genome data for five non-propagated parasite isolates taken directly from 4 different patients treated for clinical malaria in a UK hospital. Analysis of depth of coverage and length of sequence intervals between paired reads identified both previously described and novel gene deletions and amplifications. Full-length sequence data was extracted for 6 loci considered to be under selection by antimalarial drugs, and both known and previously unknown amino acid substitutions were identified. Full mitochondrial genomes were extracted from the sequencing data for each isolate, and these are compared against a panel of polymorphic sites derived from published or unpublished but publicly available data. Finally, genome-wide analysis of clone multiplicity was performed, and the number of infecting parasite clones estimated for each isolate. Each patient harboured at least 3 clones of P. falciparum by this analysis, consistent with results obtained with conventional PCR analysis of polymorphic merozoite antigen loci. We conclude that genome sequencing of peripheral blood P. falciparum taken directly from malaria patients provides high quality data useful for drug resistance studies, genomic structural analyses and population genetics, and also robustly represents clonal multiplicity.
Collapse
Affiliation(s)
- Timothy Robinson
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| | | | - Sarah Auburn
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
- Global Health Division, Menzies School of Health Research, Charles Darwin University, Darwin, Australia
| | | | - Spencer D. Polley
- Department of Clinical Parasitology, Hospital for Tropical Diseases, London, United Kingdom
| | - Magnus Manske
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Bronwyn MacInnis
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Kirk A. Rockett
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | | | - Mandy Sanders
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | | | - Peter L. Chiodini
- Department of Clinical Parasitology, Hospital for Tropical Diseases, London, United Kingdom
- Faculties of Infectious and Tropical Diseases and Epidemiology and Population Health, London School of Hygiene & Tropical Medicine, London, United Kingdom
| | - Dominic P. Kwiatkowski
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Taane G. Clark
- Faculties of Infectious and Tropical Diseases and Epidemiology and Population Health, London School of Hygiene & Tropical Medicine, London, United Kingdom
| | - Colin J. Sutherland
- Department of Clinical Parasitology, Hospital for Tropical Diseases, London, United Kingdom
- Faculties of Infectious and Tropical Diseases and Epidemiology and Population Health, London School of Hygiene & Tropical Medicine, London, United Kingdom
- * E-mail:
| |
Collapse
|
358
|
Sathirapongsasuti JF, Lee H, Horst BAJ, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. ACTA ACUST UNITED AC 2011; 27:2648-54. [PMID: 21828086 DOI: 10.1093/bioinformatics/btr462] [Citation(s) in RCA: 300] [Impact Index Per Article: 23.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
MOTIVATION The ability to detect copy-number variation (CNV) and loss of heterozygosity (LOH) from exome sequencing data extends the utility of this powerful approach that has mainly been used for point or small insertion/deletion detection. RESULTS We present ExomeCNV, a statistical method to detect CNV and LOH using depth-of-coverage and B-allele frequencies, from mapped short sequence reads, and we assess both the method's power and the effects of confounding variables. We apply our method to a cancer exome resequencing dataset. As expected, accuracy and resolution are dependent on depth-of-coverage and capture probe design. AVAILABILITY CRAN package 'ExomeCNV'. CONTACT fsathira@fas.harvard.edu; snelson@ucla.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
359
|
Shen Y, Gu Y, Pe'er I. A hidden Markov model for copy number variant prediction from whole genome resequencing data. BMC Bioinformatics 2011; 12 Suppl 6:S4. [PMID: 21989326 PMCID: PMC3194192 DOI: 10.1186/1471-2105-12-s6-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Motivation Copy Number Variants (CNVs) are important genetic factors for studying human diseases. While high-throughput whole genome re-sequencing provides multiple lines of evidence for detecting CNVs, computational algorithms need to be tailored for different type or size of CNVs under different experimental designs. Results To achieve optimal power and resolution of detecting CNVs at low depth of coverage, we implemented a Hidden Markov Model that integrates both depth of coverage and mate-pair relationship. The novelty of our algorithm is that we infer the likelihood of carrying a deletion jointly from multiple mate pairs in a region without the requirement of a single mate pairs being obvious outliers. By integrating all useful information in a comprehensive model, our method is able to detect medium-size deletions (200-2000bp) at low depth (<10× per sample). We applied the method to simulated data and demonstrate the power of detecting medium-size deletions is close to theoretical values. Availability A program implemented in Java, Zinfandel, is available at http://www.cs.columbia.edu/~itsik/zinfandel/
Collapse
Affiliation(s)
- Yufeng Shen
- Department of Computer Science, Columbia University, New York, NY 10027, USA.
| | | | | |
Collapse
|
360
|
Inferring haplotypes of copy number variations from high-throughput data with uncertainty. G3-GENES GENOMES GENETICS 2011; 1:35-42. [PMID: 22384316 PMCID: PMC3276117 DOI: 10.1534/g3.111.000174] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 03/14/2011] [Indexed: 11/18/2022]
Abstract
Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals' diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1-2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12-18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.
Collapse
|
361
|
Abstract
Advances in whole genome amplification and next-generation sequencing methods have enabled genomic analyses of single cells, and these techniques are now beginning to be used to detect genomic lesions in individual cancer cells. Previous approaches have been unable to resolve genomic differences in complex mixtures of cells, such as heterogeneous tumors, despite the importance of characterizing such tumors for cancer treatment. Sequencing of single cells is likely to improve several aspects of medicine, including the early detection of rare tumor cells, monitoring of circulating tumor cells (CTCs), measuring intratumor heterogeneity, and guiding chemotherapy. In this review we discuss the challenges and technical aspects of single-cell sequencing, with a strong focus on genomic copy number, and discuss how this information can be used to diagnose and treat cancer patients.
Collapse
|
362
|
Ritz A, Paris PL, Ittmann MM, Collins C, Raphael BJ. Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics 2011; 12:114. [PMID: 21510904 PMCID: PMC3112242 DOI: 10.1186/1471-2105-12-114] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 04/21/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variants (CNVs), including deletions, amplifications, and other rearrangements, are common in human and cancer genomes. Copy number data from array comparative genome hybridization (aCGH) and next-generation DNA sequencing is widely used to measure copy number variants. Comparison of copy number data from multiple individuals reveals recurrent variants. Typically, the interior of a recurrent CNV is examined for genes or other loci associated with a phenotype. However, in some cases, such as gene truncations and fusion genes, the target of variant lies at the boundary of the variant. RESULTS We introduce Neighborhood Breakpoint Conservation (NBC), an algorithm for identifying rearrangement breakpoints that are highly conserved at the same locus in multiple individuals. NBC detects recurrent breakpoints at varying levels of resolution, including breakpoints whose location is exactly conserved and breakpoints whose location varies within a gene. NBC also identifies pairs of recurrent breakpoints such as those that result from fusion genes. We apply NBC to aCGH data from 36 primary prostate tumors and identify 12 novel rearrangements, one of which is the well-known TMPRSS2-ERG fusion gene. We also apply NBC to 227 glioblastoma tumors and predict 93 novel rearrangements which we further classify as gene truncations, germline structural variants, and fusion genes. A number of these variants involve the protein phosphatase PTPN12 suggesting that deregulation of PTPN12, via a variety of rearrangements, is common in glioblastoma. CONCLUSIONS We demonstrate that NBC is useful for detection of recurrent breakpoints resulting from copy number variants or other structural variants, and in particular identifies recurrent breakpoints that result in gene truncations or fusion genes. Software is available at http://http.//cs.brown.edu/people/braphael/software.html.
Collapse
Affiliation(s)
- Anna Ritz
- Department of Computer Science, Brown University, Providence, RI, USA.
| | | | | | | | | |
Collapse
|
363
|
Nord AS, Lee M, King MC, Walsh T. Accurate and exact CNV identification from targeted high-throughput sequence data. BMC Genomics 2011; 12:184. [PMID: 21486468 PMCID: PMC3088570 DOI: 10.1186/1471-2164-12-184] [Citation(s) in RCA: 156] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2010] [Accepted: 04/12/2011] [Indexed: 11/23/2022] Open
Abstract
Background Massively parallel sequencing of barcoded DNA samples significantly increases screening efficiency for clinically important genes. Short read aligners are well suited to single nucleotide and indel detection. However, methods for CNV detection from targeted enrichment are lacking. We present a method combining coverage with map information for the identification of deletions and duplications in targeted sequence data. Results Sequencing data is first scanned for gains and losses using a comparison of normalized coverage data between samples. CNV calls are confirmed by testing for a signature of sequences that span the CNV breakpoint. With our method, CNVs can be identified regardless of whether breakpoints are within regions targeted for sequencing. For CNVs where at least one breakpoint is within targeted sequence, exact CNV breakpoints can be identified. In a test data set of 96 subjects sequenced across ~1 Mb genomic sequence using multiplexing technology, our method detected mutations as small as 31 bp, predicted quantitative copy count, and had a low false-positive rate. Conclusions Application of this method allows for identification of gains and losses in targeted sequence data, providing comprehensive mutation screening when combined with a short read aligner.
Collapse
Affiliation(s)
- Alex S Nord
- Department of Genome Sciences, University of Washington, Seattle, 98195-7720, USA.
| | | | | | | |
Collapse
|
364
|
Polymorphic family of injected pseudokinases is paramount in Toxoplasma virulence. Proc Natl Acad Sci U S A 2011; 108:9625-30. [PMID: 21436047 DOI: 10.1073/pnas.1015980108] [Citation(s) in RCA: 196] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Toxoplasma gondii, an obligate intracellular parasite of the phylum Apicomplexa, has the unusual ability to infect virtually any warm-blooded animal. It is an extraordinarily successful parasite, infecting an estimated 30% of humans worldwide. The outcome of Toxoplasma infection is highly dependent on allelic differences in the large number of effectors that the parasite secretes into the host cell. Here, we show that the largest determinant of the virulence difference between two of the most common strains of Toxoplasma is the ROP5 locus. This is an unusual segment of the Toxoplasma genome consisting of a family of 4-10 tandem, highly divergent genes encoding pseudokinases that are injected directly into host cells. Given their hypothesized catalytic inactivity, it is striking that deletion of the ROP5 cluster in a highly virulent strain caused a complete loss of virulence, showing that ROP5 proteins are, in fact, indispensable for Toxoplasma to cause disease in mice. We find that copy number at this locus varies among the three major Toxoplasma lineages and that extensive polymorphism is clustered into hotspots within the ROP5 pseudokinase domain. We propose that the ROP5 locus represents an unusual evolutionary strategy for sampling of sequence space in which the gene encoding an important enzyme has been (i) catalytically inactivated, (ii) expanded in number, and (iii) subject to strong positive selection. Such a strategy likely contributes to Toxoplasma's successful adaptation to a wide host range and has resulted in dramatic differences in virulence.
Collapse
|
365
|
Tumour evolution inferred by single-cell sequencing. Nature 2011; 472:90-4. [PMID: 21399628 DOI: 10.1038/nature09807] [Citation(s) in RCA: 1866] [Impact Index Per Article: 143.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2010] [Accepted: 01/07/2011] [Indexed: 12/13/2022]
Abstract
Genomic analysis provides insights into the role of copy number variation in disease, but most methods are not designed to resolve mixed populations of cells. In tumours, where genetic heterogeneity is common, very important information may be lost that would be useful for reconstructing evolutionary history. Here we show that with flow-sorted nuclei, whole genome amplification and next generation sequencing we can accurately quantify genomic copy number within an individual nucleus. We apply single-nucleus sequencing to investigate tumour population structure and evolution in two human breast cancer cases. Analysis of 100 single cells from a polygenomic tumour revealed three distinct clonal subpopulations that probably represent sequential clonal expansions. Additional analysis of 100 single cells from a monogenomic primary tumour and its liver metastasis indicated that a single clonal expansion formed the primary tumour and seeded the metastasis. In both primary tumours, we also identified an unexpectedly abundant subpopulation of genetically diverse 'pseudodiploid' cells that do not travel to the metastatic site. In contrast to gradual models of tumour progression, our data indicate that tumours grow by punctuated clonal expansions with few persistent intermediates.
Collapse
|
366
|
Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet 2011; 12:363-76. [PMID: 21358748 DOI: 10.1038/nrg2958] [Citation(s) in RCA: 963] [Impact Index Per Article: 74.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Comparisons of human genomes show that more base pairs are altered as a result of structural variation - including copy number variation - than as a result of point mutations. Here we review advances and challenges in the discovery and genotyping of structural variation. The recent application of massively parallel sequencing methods has complemented microarray-based methods and has led to an exponential increase in the discovery of smaller structural-variation events. Some global discovery biases remain, but the integration of experimental and computational approaches is proving fruitful for accurate characterization of the copy, content and structure of variable regions. We argue that the long-term goal should be routine, cost-effective and high quality de novo assembly of human genomes to comprehensively assess all classes of structural variation.
Collapse
Affiliation(s)
- Can Alkan
- Department of Genome Sciences, University of Washington School of Medicine, Foege S413C, 3720 15th Ave NE, Seattle, Washington, USA
| | | | | |
Collapse
|
367
|
Mapping copy number variation by population-scale genome sequencing. Nature 2011; 470:59-65. [PMID: 21293372 PMCID: PMC3077050 DOI: 10.1038/nature09708] [Citation(s) in RCA: 821] [Impact Index Per Article: 63.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2010] [Accepted: 11/26/2010] [Indexed: 11/08/2022]
Abstract
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Collapse
|
368
|
Revisiting Mendelian disorders through exome sequencing. Hum Genet 2011; 129:351-70. [PMID: 21331778 DOI: 10.1007/s00439-011-0964-2] [Citation(s) in RCA: 147] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2010] [Accepted: 02/03/2011] [Indexed: 12/25/2022]
Abstract
Over the past several years, more focus has been placed on dissecting the genetic basis of complex diseases and traits through genome-wide association studies. In contrast, Mendelian disorders have received little attention mainly due to the lack of newer and more powerful methods to study these disorders. Linkage studies have previously been the main tool to elucidate the genetics of Mendelian disorders; however, extremely rare disorders or sporadic cases caused by de novo variants are not amendable to this study design. Exome sequencing has now become technically feasible and more cost-effective due to the recent advances in high-throughput sequence capture methods and next-generation sequencing technologies which have offered new opportunities for Mendelian disorder research. Exome sequencing has been swiftly applied to the discovery of new causal variants and candidate genes for a number of Mendelian disorders such as Kabuki syndrome, Miller syndrome and Fowler syndrome. In addition, de novo variants were also identified for sporadic cases, which would have not been possible without exome sequencing. Although exome sequencing has been proven to be a promising approach to study Mendelian disorders, several shortcomings of this method must be noted, such as the inability to capture regulatory or evolutionary conserved sequences in non-coding regions and the incomplete capturing of all exons.
Collapse
|
369
|
Magi A, Benelli M, Yoon S, Roviello F, Torricelli F. Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res 2011; 39:e65. [PMID: 21321017 PMCID: PMC3105418 DOI: 10.1093/nar/gkr068] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The discovery of genomic structural variants (SVs), such as copy number variants (CNVs), is essential to understand genetic variation of human populations and complex diseases. Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome. At present, few computational methods have been developed for the analysis of DOC data and all of these methods allow to analyse only one sample at time. For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously. We test JointSLM performance on synthetic and real data and we show its unprecedented resolution that enables the detection of recurrent CNV regions as small as 500 bp in size. When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.
Collapse
Affiliation(s)
- Alberto Magi
- Laboratory Department, Diagnostic Genetic Unit, Careggi Hospital, Florence 5014, Italy.
| | | | | | | | | |
Collapse
|
370
|
Handsaker RE, Korn JM, Nemesh J, McCarroll SA. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet 2011; 43:269-76. [PMID: 21317889 PMCID: PMC5094049 DOI: 10.1038/ng.768] [Citation(s) in RCA: 242] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2010] [Accepted: 01/20/2011] [Indexed: 11/09/2022]
Abstract
Accurate and complete analysis of genome variation in large populations will be required to understand the role of genome variation in complex disease. We present an analytical framework for characterizing genome deletion polymorphism in populations using sequence data that are distributed across hundreds or thousands of genomes. Our approach uses population-level concepts to reinterpret the technical features of sequence data that often reflect structural variation. In the 1000 Genomes Project pilot, this approach identified deletion polymorphism across 168 genomes (sequenced at 4 × average coverage) with sensitivity and specificity unmatched by other algorithms. We also describe a way to determine the allelic state or genotype of each deletion polymorphism in each genome; the 1000 Genomes Project used this approach to type 13,826 deletion polymorphisms (48-995,664 bp) at high accuracy in populations. These methods offer a way to relate genome structural polymorphism to complex disease in populations.
Collapse
Affiliation(s)
- Robert E Handsaker
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA
| | | | | | | |
Collapse
|
371
|
Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 2011; 21:974-84. [PMID: 21324876 DOI: 10.1101/gr.114876.110] [Citation(s) in RCA: 1089] [Impact Index Per Article: 83.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Copy number variation (CNV) in the genome is a complex phenomenon, and not completely understood. We have developed a method, CNVnator, for CNV discovery and genotyping from read-depth (RD) analysis of personal genome sequencing. Our method is based on combining the established mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) to broaden the range of discovered CNVs. We calibrated CNVnator using the extensive validation performed by the 1000 Genomes Project. Because of this, we could use CNVnator for CNV discovery and genotyping in a population and characterization of atypical CNVs, such as de novo and multi-allelic events. Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate (3%-20%), high genotyping accuracy (93%-95%), and high resolution in breakpoint discovery (<200 bp in 90% of cases with high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approaches: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair. By genotyping CNVs in the CEPH, Yoruba, and Chinese-Japanese populations, we estimated that at least 11% of all CNV loci involve complex, multi-allelic events, a considerably higher estimate than reported earlier. Moreover, among these events, we observed cases with allele distribution strongly deviating from Hardy-Weinberg equilibrium, possibly implying selection on certain complex loci. Finally, by combining discovery and genotyping, we identified six potential de novo CNVs in two family trios.
Collapse
Affiliation(s)
- Alexej Abyzov
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.
| | | | | | | |
Collapse
|
372
|
Miller CA, Hampton O, Coarfa C, Milosavljevic A. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One 2011; 6:e16327. [PMID: 21305028 PMCID: PMC3031566 DOI: 10.1371/journal.pone.0016327] [Citation(s) in RCA: 149] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2010] [Accepted: 12/10/2010] [Indexed: 11/18/2022] Open
Abstract
Copy number alterations are important contributors to many genetic diseases, including cancer. We present the readDepth package for R, which can detect these aberrations by measuring the depth of coverage obtained by massively parallel sequencing of the genome. In addition to achieving higher accuracy than existing packages, our tool runs much faster by utilizing multi-core architectures to parallelize the processing of these large data sets. In contrast to other published methods, readDepth does not require the sequencing of a reference sample, and uses a robust statistical model that accounts for overdispersed data. It includes a method for effectively increasing the resolution obtained from low-coverage experiments by utilizing breakpoint information from paired end sequencing to do positional refinement. We also demonstrate a method for inferring copy number using reads generated by whole-genome bisulfite sequencing, thus enabling integrative study of epigenomic and copy number alterations. Finally, we apply this tool to two genomes, showing that it performs well on genomes sequenced to both low and high coverage. The readDepth package runs on Linux and MacOSX, is released under the Apache 2.0 license, and is available at http://code.google.com/p/readdepth/.
Collapse
Affiliation(s)
- Christopher A. Miller
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Oliver Hampton
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Cristian Coarfa
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Aleksandar Milosavljevic
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
373
|
Hammond S, Swanberg JC, Kaplarevic M, Lee KH. Genomic sequencing and analysis of a Chinese hamster ovary cell line using Illumina sequencing technology. BMC Genomics 2011; 12:67. [PMID: 21269493 PMCID: PMC3038171 DOI: 10.1186/1471-2164-12-67] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2010] [Accepted: 01/26/2011] [Indexed: 11/20/2022] Open
Abstract
Background Chinese hamster ovary (CHO) cells are among the most widely used hosts for therapeutic protein production. Yet few genomic resources are available to aid in engineering high-producing cell lines. Results High-throughput Illumina sequencing was used to generate a 1x genomic coverage of an engineered CHO cell line expressing secreted alkaline phosphatase (SEAP). Reference-guided alignment and assembly produced 3.57 million contigs and CHO-specific sequence information for ~ 18,000 mouse and ~ 19,000 rat orthologous genes. The majority of these genes are involved in metabolic processes, cellular signaling, and transport and represent attractive targets for cell line engineering. Conclusions This demonstrates the applicability of next-generation sequencing technology and comparative genomic analysis in the development of CHO genomic resources.
Collapse
Affiliation(s)
- Stephanie Hammond
- Department of Chemical Engineering, University of Delaware, Newark, DE 19711, USA
| | | | | | | |
Collapse
|
374
|
Xi R, Kim TM, Park PJ. Detecting structural variations in the human genome using next generation sequencing. Brief Funct Genomics 2011; 9:405-15. [PMID: 21216738 DOI: 10.1093/bfgp/elq025] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Structural variations are widespread in the human genome and can serve as genetic markers in clinical and evolutionary studies. With the advances in the next-generation sequencing technology, recent methods allow for identification of structural variations with unprecedented resolution and accuracy. They also provide opportunities to discover variants that could not be detected on conventional microarray-based platforms, such as dosage-invariant chromosomal translocations and inversions. In this review, we will describe some of the sequencing-based algorithms for detection of structural variations and discuss the key issues in future development.
Collapse
Affiliation(s)
- Ruibin Xi
- Center for Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | | | | |
Collapse
|
375
|
Warden M, Pique-Regi R, Ortega A, Asgharzadeh S. Bioinformatics for copy number variation data. Methods Mol Biol 2011; 719:235-49. [PMID: 21370087 DOI: 10.1007/978-1-61779-027-0_11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Copy number variation is known to be an important component of structural variation in the human genome. Greater than 1 kb in size, these gains and losses of genetic material are known to confer risk to many human diseases, both Mendelian and complex. Therefore, the technologies used to detect copy number variation have been quickly improving in both throughput and cost. From comparative genomic hybridization to synthetic high-density oligonucleotide arrays to next-generation sequencing methods, algorithms used to estimate copy number are plentiful. Here we describe a practical introduction to the copy number variation technology and available analysis methods, and demonstrate the analysis flow on an example case.
Collapse
Affiliation(s)
- Melissa Warden
- Department of Pediatrics and Pathology, Keck School of Medicine, Childrens Hospital Los Angeles, University of Southern California, Los Angeles, CA, USA
| | | | | | | |
Collapse
|
376
|
Wong K, Keane TM, Stalker J, Adams DJ. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol 2010; 11:R128. [PMID: 21194472 PMCID: PMC3046488 DOI: 10.1186/gb-2010-11-12-r128] [Citation(s) in RCA: 82] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2010] [Revised: 11/09/2010] [Accepted: 12/31/2010] [Indexed: 11/10/2022] Open
Abstract
We present a pipeline, SVMerge, to detect structural variants by integrating calls from several existing structural variant callers, which are then validated and the breakpoints refined using local de novo assembly. SVMerge is modular and extensible, allowing new callers to be incorporated as they become available. We applied SVMerge to the analysis of a HapMap trio, demonstrating enhanced structural variant detection, breakpoint refinement, and a lower false discovery rate. SVMerge can be downloaded from http://svmerge.sourceforge.net.
Collapse
Affiliation(s)
- Kim Wong
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | |
Collapse
|
377
|
Abstract
Esophageal atresia and tracheoesophageal fistula (EA/TEF) are major congenital malformations affecting 1:3500 live births. Current research efforts are focused on understanding the etiology of these defects. We describe well-known animal models, human syndromes, and associations involving EA/TEF, indicating its etiologically heterogeneous nature. Recent advances in genotyping technology and in knowledge of human genetic variation will improve clinical counseling on etiologic factors. This review provides a clinical summary of environmental and genetic factors involved in EA/TEF.
Collapse
|
378
|
Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, Thines M, Jiang RHY, Zody MC, Kunjeti SG, Donofrio NM, Meyers BC, Nusbaum C, Kamoun S. Genome evolution following host jumps in the Irish potato famine pathogen lineage. Science 2010; 330:1540-3. [PMID: 21148391 DOI: 10.1126/science.1193070] [Citation(s) in RCA: 272] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Many plant pathogens, including those in the lineage of the Irish potato famine organism Phytophthora infestans, evolve by host jumps followed by specialization. However, how host jumps affect genome evolution remains largely unknown. To determine the patterns of sequence variation in the P. infestans lineage, we resequenced six genomes of four sister species. This revealed uneven evolutionary rates across genomes with genes in repeat-rich regions showing higher rates of structural polymorphisms and positive selection. These loci are enriched in genes induced in planta, implicating host adaptation in genome evolution. Unexpectedly, genes involved in epigenetic processes formed another class of rapidly evolving residents of the gene-sparse regions. These results demonstrate that dynamic repeat-rich genome compartments underpin accelerated gene evolution following host jumps in this pathogen lineage.
Collapse
Affiliation(s)
- Sylvain Raffaele
- The Sainsbury Laboratory, Norwich Research Park, Norwich NR4 7UH, UK
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
379
|
Jongbloed JDH, Pósafalvi A, Kerstjens-Frederikse WS, Sinke RJ, van Tintelen JP. New clinical molecular diagnostic methods for congenital and inherited heart disease. ACTA ACUST UNITED AC 2010; 5:9-24. [DOI: 10.1517/17530059.2011.540566] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
380
|
Quintero-Rivera F, Deignan JL, Peredo J, Grody WW, Crandall B, Sims M, Cederbaum SD. An exon 1 deletion in OTC identified using chromosomal microarray analysis in a mother and her two affected deceased newborns: implications for the prenatal diagnosis of ornithine transcarbamylase deficiency. Mol Genet Metab 2010; 101:413-6. [PMID: 20817516 DOI: 10.1016/j.ymgme.2010.08.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/15/2010] [Revised: 08/06/2010] [Accepted: 08/07/2010] [Indexed: 11/24/2022]
Abstract
We describe the outcome of two consecutive pregnancies with a clinical presentation of ornithine transcarbamylase (OTC) deficiency (OTCD) without a molecular diagnosis. A 119kb deletion on Xp11.4 including the OTC gene was detected in the mother. The same deletion was identified in the blood spots from deceased male newborns. In patients with a clinical and biochemical presentation of OTCD and negative OTC sequencing, whole genome or targeted chromosomal microarray analysis (CMA) with coverage of the OTC and neighboring genes should be performed as a reflex test.
Collapse
Affiliation(s)
- Fabiola Quintero-Rivera
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
| | | | | | | | | | | | | |
Collapse
|
381
|
Ku CS, Naidoo N, Teo SM, Pawitan Y. Regions of homozygosity and their impact on complex diseases and traits. Hum Genet 2010; 129:1-15. [PMID: 21104274 DOI: 10.1007/s00439-010-0920-6] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2010] [Accepted: 11/04/2010] [Indexed: 12/23/2022]
Abstract
Regions of homozygosity (ROHs) are more abundant in the human genome than previously thought. These regions are without heterozygosity, i.e. all the genetic variations within the regions have two identical alleles. At present there are no standardized criteria for defining the ROHs resulting in the different studies using their own criteria in the analysis of homozygosity. Compared to the era of genotyping microsatellite markers, the advent of high-density single nucleotide polymorphism genotyping arrays has provided an unparalleled opportunity to comprehensively detect these regions in the whole genome in different populations. Several studies have identified ROHs which were associated with complex phenotypes such as schizophrenia, late-onset of Alzheimer's disease and height. Collectively, these studies have conclusively shown the abundance of ROHs larger than 1 Mb in outbred populations. The homozygosity association approach holds great promise in identifying genetic susceptibility loci harboring recessive variants for complex diseases and traits.
Collapse
Affiliation(s)
- Chee Seng Ku
- Department of Epidemiology and Public Health, Centre for Molecular Epidemiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| | | | | | | |
Collapse
|
382
|
Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. ACTA ACUST UNITED AC 2010; 27:268-9. [PMID: 21081509 PMCID: PMC3018818 DOI: 10.1093/bioinformatics/btq635] [Citation(s) in RCA: 183] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Summary: We present a tool for control-free copy number alteration (CNA) detection using deep-sequencing data, particularly useful for cancer studies. The tool deals with two frequent problems in the analysis of cancer deep-sequencing data: absence of control sample and possible polyploidy of cancer cells. FREEC (control-FREE Copy number caller) automatically normalizes and segments copy number profiles (CNPs) and calls CNAs. If ploidy is known, FREEC assigns absolute copy number to each predicted CNA. To normalize raw CNPs, the user can provide a control dataset if available; otherwise GC content is used. We demonstrate that for Illumina single-end, mate-pair or paired-end sequencing, GC-contentr normalization provides smooth profiles that can be further segmented and analyzed in order to predict CNAs. Availability: Source code and sample data are available at http://bioinfo-out.curie.fr/projects/freec/. Contact:freec@curie.fr Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
383
|
Waszak SM, Hasin Y, Zichner T, Olender T, Keydar I, Khen M, Stütz AM, Schlattl A, Lancet D, Korbel JO. Systematic inference of copy-number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity. PLoS Comput Biol 2010; 6:e1000988. [PMID: 21085617 PMCID: PMC2978733 DOI: 10.1371/journal.pcbi.1000988] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 10/05/2010] [Indexed: 12/02/2022] Open
Abstract
Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95–99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ∼15% and ∼20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing. Human individual genome sequencing has recently become affordable, enabling highly detailed genetic sequence comparisons. While the identification and genotyping of single-nucleotide polymorphisms has already been successfully established for different sequencing platforms, the detection, quantification and genotyping of large-scale copy-number variants (CNVs), i.e., losses or gains of long genomic segments, has remained challenging. We present a computational approach that enables detecting CNVs in sequencing data and accurately identifies the actual copy-number at which DNA segments of interest occur in an individual genome. This approach enabled us to obtain novel insights into the largest human gene family – the olfactory receptors (ORs) – involved in smell perception. While previous studies reported an abundance of CNVs in ORs, our approach enabled us to globally identify absolute differences in OR gene counts that exist between humans. While several OR genes have very high gene counts, other ORs are found only once or are missing entirely in some individuals. The latter have a particularly high probability of influencing individual differences in the perception of smell, a question that future experimental efforts can now address. Furthermore, we observed differences in OR gene counts between populations, pointing at ORs that might contribute to population-specific differences in smell.
Collapse
Affiliation(s)
- Sebastian M. Waszak
- Department of Molecular Genetics, Crown Human Genome Center, Weizmann Institute of Science, Rehovot, Israel
- Department of Biotechnology and Bioinformatics, Weihenstephan-Triesdorf University of Applied Sciences, Freising, Germany
- Genome Biology Research Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Yehudit Hasin
- Department of Molecular Genetics, Crown Human Genome Center, Weizmann Institute of Science, Rehovot, Israel
| | - Thomas Zichner
- Genome Biology Research Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Tsviya Olender
- Department of Molecular Genetics, Crown Human Genome Center, Weizmann Institute of Science, Rehovot, Israel
| | - Ifat Keydar
- Department of Molecular Genetics, Crown Human Genome Center, Weizmann Institute of Science, Rehovot, Israel
| | - Miriam Khen
- Department of Molecular Genetics, Crown Human Genome Center, Weizmann Institute of Science, Rehovot, Israel
| | - Adrian M. Stütz
- Genome Biology Research Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Andreas Schlattl
- Genome Biology Research Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Doron Lancet
- Department of Molecular Genetics, Crown Human Genome Center, Weizmann Institute of Science, Rehovot, Israel
| | - Jan O. Korbel
- Genome Biology Research Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
- European Bioinformatics Institute, EMBL-EBI, Hinxton, United Kingdom
- * E-mail:
| |
Collapse
|
384
|
Hong D, Park SS, Ju YS, Kim S, Shin JY, Kim S, Yu SB, Lee WC, Lee S, Park H, Kim JI, Seo JS. TIARA: a database for accurate analysis of multiple personal genomes based on cross-technology. Nucleic Acids Res 2010; 39:D883-8. [PMID: 21051338 PMCID: PMC3013693 DOI: 10.1093/nar/gkq1101] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
High-throughput genomic technologies have been used to explore personal human genomes for the past few years. Although the integration of technologies is important for high-accuracy detection of personal genomic variations, no databases have been prepared to systematically archive genomes and to facilitate the comparison of personal genomic data sets prepared using a variety of experimental platforms. We describe here the Total Integrated Archive of Short-Read and Array (TIARA; http://tiara.gmi.ac.kr) database, which contains personal genomic information obtained from next generation sequencing (NGS) techniques and ultra-high-resolution comparative genomic hybridization (CGH) arrays. This database improves the accuracy of detecting personal genomic variations, such as SNPs, short indels and structural variants (SVs). At present, 36 individual genomes have been archived and may be displayed in the database. TIARA supports a user-friendly genome browser, which retrieves read-depths (RDs) and log2 ratios from NGS and CGH arrays, respectively. In addition, this database provides information on all genomic variants and the raw data, including short reads and feature-level CGH data, through anonymous file transfer protocol. More personal genomes will be archived as more individuals are analyzed by NGS or CGH array. TIARA provides a new approach to the accurate interpretation of personal genomes for genome research.
Collapse
Affiliation(s)
- Dongwan Hong
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul 110-799, Korea
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
385
|
Ivakhno S, Royce T, Cox AJ, Evers DJ, Cheetham RK, Tavaré S. CNAseg—a novel framework for identification of copy number changes in cancer from second-generation sequencing data. Bioinformatics 2010; 26:3051-8. [DOI: 10.1093/bioinformatics/btq587] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
|
386
|
Ding L, Wendl MC, Koboldt DC, Mardis ER. Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet 2010; 19:R188-96. [PMID: 20843826 DOI: 10.1093/hmg/ddq391] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The application of next-generation sequencing technology has produced a transformation in cancer genomics, generating large data sets that can be analyzed in different ways to answer a multitude of questions about the genomic alterations associated with the disease. Analytical approaches can discover focused mutations such as substitutions and small insertion/deletions, large structural alterations and copy number events. As our capacity to produce such data for multiple cancers of the same type is improving, so are the demands to analyze multiple tumor genomes simultaneously growing. For example, pathway-based analyses that provide the full mutational impact on cellular protein networks and correlation analyses aimed at revealing causal relationships between genomic alterations and clinical presentations are both enabled. As the repertoire of data grows to include mRNA-seq, non-coding RNA-seq and methylation for multiple genomes, our challenge will be to intelligently integrate data types and genomes to produce a coherent picture of the genetic basis of cancer.
Collapse
Affiliation(s)
- Li Ding
- Department of Genetics, The Genome Center at Washington University School of Medicine, 4444 Forest Park Blvd., St Louis, MO 63108, USA
| | | | | | | |
Collapse
|
387
|
Magi A, Benelli M, Gozzini A, Girolami F, Torricelli F, Brandi ML. Bioinformatics for next generation sequencing data. Genes (Basel) 2010; 1:294-307. [PMID: 24710047 PMCID: PMC3954090 DOI: 10.3390/genes1020294] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2010] [Revised: 08/30/2010] [Accepted: 09/14/2010] [Indexed: 12/31/2022] Open
Abstract
The emergence of next-generation sequencing (NGS) platforms imposes increasing demands on statistical methods and bioinformatic tools for the analysis and the management of the huge amounts of data generated by these technologies. Even at the early stages of their commercial availability, a large number of softwares already exist for analyzing NGS data. These tools can be fit into many general categories including alignment of sequence reads to a reference, base-calling and/or polymorphism detection, de novo assembly from paired or unpaired reads, structural variant detection and genome browsing. This manuscript aims to guide readers in the choice of the available computational tools that can be used to face the several steps of the data analysis workflow.
Collapse
Affiliation(s)
- Alberto Magi
- Diagnostic Genetic Unit, Careggi Hospital, Azienda Ospedaliera Universitaria Careggi, University of Florence, Florence, Italy.
| | - Matteo Benelli
- Diagnostic Genetic Unit, Careggi Hospital, Azienda Ospedaliera Universitaria Careggi, University of Florence, Florence, Italy.
| | - Alessia Gozzini
- Diagnostic Genetic Unit, Careggi Hospital, Azienda Ospedaliera Universitaria Careggi, University of Florence, Florence, Italy.
| | - Francesca Girolami
- Diagnostic Genetic Unit, Careggi Hospital, Azienda Ospedaliera Universitaria Careggi, University of Florence, Florence, Italy.
| | - Francesca Torricelli
- Diagnostic Genetic Unit, Careggi Hospital, Azienda Ospedaliera Universitaria Careggi, University of Florence, Florence, Italy.
| | - Maria Luisa Brandi
- Department of Internal Medicine, University of Florence Medical School, Florence, Italy.
| |
Collapse
|
388
|
Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M. Detecting copy number variation with mated short reads. Genome Res 2010; 20:1613-22. [PMID: 20805290 DOI: 10.1101/gr.106344.110] [Citation(s) in RCA: 119] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The development of high-throughput sequencing (HTS) technologies has opened the door to novel methods for detecting copy number variants (CNVs) in the human genome. While in the past CNVs have been detected based on array CGH data, recent studies have shown that depth-of-coverage information from HTS technologies can also be used for the reliable identification of large copy-variable regions. Such methods, however, are hindered by sequencing biases that lead certain regions of the genome to be over- or undersampled, lowering their resolution and ability to accurately identify the exact breakpoints of the variants. In this work, we develop a method for CNV detection that supplements the depth-of-coverage with paired-end mapping information, where mate pairs mapping discordantly to the reference serve to indicate the presence of variation. Our algorithm, called CNVer, combines this information within a unified computational framework called the donor graph, allowing us to better mitigate the sequencing biases that cause uneven local coverage and accurately predict CNVs. We use CNVer to detect 4879 CNVs in the recently described genome of a Yoruban individual. Most of the calls (77%) coincide with previously known variants within the Database of Genomic Variants, while 81% of deletion copy number variants previously known for this individual coincide with one of our loss calls. Furthermore, we demonstrate that CNVer can reconstruct the absolute copy counts of segments of the donor genome and evaluate the feasibility of using CNVer with low coverage datasets.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science, University of Toronto, Toronto, Ontario M5R 3G4, Canada
| | | | | | | | | |
Collapse
|
389
|
Ju YS, Hong D, Kim S, Park SS, Kim S, Lee S, Park H, Kim JI, Seo JS. Reference-unbiased copy number variant analysis using CGH microarrays. Nucleic Acids Res 2010; 38:e190. [PMID: 20802225 PMCID: PMC2978381 DOI: 10.1093/nar/gkq730] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Comparative genomic hybridization (CGH) microarrays have been used to determine copy number variations (CNVs) and their effects on complex diseases. Detection of absolute CNVs independent of genomic variants of an arbitrary reference sample has been a critical issue in CGH array experiments. Whole genome analysis using massively parallel sequencing with multiple ultra-high resolution CGH arrays provides an opportunity to catalog highly accurate genomic variants of the reference DNA (NA10851). Using information on variants, we developed a new method, the CGH array reference-free algorithm (CARA), which can determine reference-unbiased absolute CNVs from any CGH array platform. The algorithm enables the removal and rescue of false positive and false negative CNVs, respectively, which appear due to the effects of genomic variants of the reference sample in raw CGH array experiments. We found that the CARA remarkably enhanced the accuracy of CGH array in determining absolute CNVs. Our method thus provides a new approach to interpret CGH array data for personalized medicine.
Collapse
Affiliation(s)
- Young Seok Ju
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul 110-799, Korea
| | | | | | | | | | | | | | | | | |
Collapse
|
390
|
Kim TM, Luquette LJ, Xi R, Park PJ. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics 2010; 11:432. [PMID: 20718989 PMCID: PMC2939611 DOI: 10.1186/1471-2105-11-432] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2009] [Accepted: 08/18/2010] [Indexed: 02/05/2023] Open
Abstract
Background Recent advances in sequencing technologies have enabled generation of large-scale genome sequencing data. These data can be used to characterize a variety of genomic features, including the DNA copy number profile of a cancer genome. A robust and reliable method for screening chromosomal alterations would allow a detailed characterization of the cancer genome with unprecedented accuracy. Results We develop a method for identification of copy number alterations in a tumor genome compared to its matched control, based on application of Smith-Waterman algorithm to single-end sequencing data. In a performance test with simulated data, our algorithm shows >90% sensitivity and >90% precision in detecting a single copy number change that contains approximately 500 reads for the normal sample. With 100-bp reads, this corresponds to a ~50 kb region for 1X genome coverage of the human genome. We further refine the algorithm to develop rSW-seq, (recursive Smith-Waterman-seq) to identify alterations in a complex configuration, which are commonly observed in the human cancer genome. To validate our approach, we compare our algorithm with an existing algorithm using simulated and publicly available datasets. We also compare the sequencing-based profiles to microarray-based results. Conclusion We propose rSW-seq as an efficient method for detecting copy number changes in the tumor genome.
Collapse
Affiliation(s)
- Tae-Min Kim
- Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, Massachusetts 02115, USA
| | | | | | | |
Collapse
|
391
|
Wood HM, Belvedere O, Conway C, Daly C, Chalkley R, Bickerdike M, McKinley C, Egan P, Ross L, Hayward B, Morgan J, Davidson L, MacLennan K, Ong TK, Papagiannopoulos K, Cook I, Adams DJ, Taylor GR, Rabbitts P. Using next-generation sequencing for high resolution multiplex analysis of copy number variation from nanogram quantities of DNA from formalin-fixed paraffin-embedded specimens. Nucleic Acids Res 2010; 38:e151. [PMID: 20525786 PMCID: PMC2919738 DOI: 10.1093/nar/gkq510] [Citation(s) in RCA: 95] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
The use of next-generation sequencing technologies to produce genomic copy number data has recently been described. Most approaches, however, reply on optimal starting DNA, and are therefore unsuitable for the analysis of formalin-fixed paraffin-embedded (FFPE) samples, which largely precludes the analysis of many tumour series. We have sought to challenge the limits of this technique with regards to quality and quantity of starting material and the depth of sequencing required. We confirm that the technique can be used to interrogate DNA from cell lines, fresh frozen material and FFPE samples to assess copy number variation. We show that as little as 5 ng of DNA is needed to generate a copy number karyogram, and follow this up with data from a series of FFPE biopsies and surgical samples. We have used various levels of sample multiplexing to demonstrate the adjustable resolution of the methodology, depending on the number of samples and available resources. We also demonstrate reproducibility by use of replicate samples and comparison with microarray-based comparative genomic hybridization (aCGH) and digital PCR. This technique can be valuable in both the analysis of routine diagnostic samples and in examining large repositories of fixed archival material.
Collapse
Affiliation(s)
- Henry M Wood
- Leeds Institute of Molecular Medicine, St James's University Hospital, Leeds, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
392
|
Koboldt DC, Ding L, Mardis ER, Wilson RK. Challenges of sequencing human genomes. Brief Bioinform 2010; 11:484-98. [PMID: 20519329 DOI: 10.1093/bib/bbq016] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35-250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
Collapse
Affiliation(s)
- Daniel C Koboldt
- The Genome Center at Washington University, St. Louis, Missouri 63108, USA.
| | | | | | | |
Collapse
|
393
|
Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 2010; 11:415-25. [PMID: 20479773 DOI: 10.1038/nrg2779] [Citation(s) in RCA: 827] [Impact Index Per Article: 59.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Although genome-wide association (GWA) studies for common variants have thus far succeeded in explaining only a modest fraction of the genetic components of human common diseases, recent advances in next-generation sequencing technologies could rapidly facilitate substantial progress. This outcome is expected if much of the missing genetic control is due to gene variants that are too rare to be picked up by GWA studies and have relatively large effects on risk. Here, we evaluate the evidence for an important role of rare gene variants of major effect in common diseases and outline discovery strategies for their identification.
Collapse
Affiliation(s)
- Elizabeth T Cirulli
- Center for Human Genome Variation, Duke University Medical School, Durham, North Carolina 27708, USA
| | | |
Collapse
|
394
|
Machado HE, Renn SCP. A critical assessment of cross-species detection of gene duplicates using comparative genomic hybridization. BMC Genomics 2010; 11:304. [PMID: 20465839 PMCID: PMC2876127 DOI: 10.1186/1471-2164-11-304] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2010] [Accepted: 05/13/2010] [Indexed: 11/15/2022] Open
Abstract
Background Comparison of genomic DNA among closely related strains or species is a powerful approach for identifying variation in evolutionary processes. One potent source of genomic variation is gene duplication, which is prevalent among individuals and species. Array comparative genomic hybridization (aCGH) has been successfully utilized to detect this variation among lineages. Here, beyond the demonstration that gene duplicates among species can be quantified with aCGH, we consider the effect of sequence divergence on the ability to detect gene duplicates. Results Using the X chromosome genomic content difference between male D. melanogaster and female D. yakuba and D. simulans, we describe a decrease in the ability to accurately measure genomic content (copy number) for orthologs that are only 90% identical. We demonstrate that genome characteristics (e.g. chromatin environment and non-orthologous sequence similarity) can also affect the ability to accurately measure genomic content. We describe a normalization strategy and statistical criteria to be used for the identification of gene duplicates among any species group for which an array platform is available from a closely related species. Conclusions Array CGH can be used to effectively identify gene duplication and genome content; however, certain biases are present due to sequence divergence and other genome characteristics resulting from the divergence between lineages. Highly conserved gene duplicates will be more readily recovered by aCGH. Duplicates that have been retained for a selective advantage due to directional selection acting on many loci in one or both gene copies are likely to be under-represented. The results of this study should inform the interpretation of both previously published and future work that employs this powerful technique.
Collapse
|
395
|
Fadista J, Thomsen B, Holm LE, Bendixen C. Copy number variation in the bovine genome. BMC Genomics 2010; 11:284. [PMID: 20459598 PMCID: PMC2902221 DOI: 10.1186/1471-2164-11-284] [Citation(s) in RCA: 126] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2009] [Accepted: 05/06/2010] [Indexed: 12/12/2022] Open
Abstract
Background Copy number variations (CNVs), which represent a significant source of genetic diversity in mammals, have been shown to be associated with phenotypes of clinical relevance and to be causative of disease. Notwithstanding, little is known about the extent to which CNV contributes to genetic variation in cattle. Results We designed and used a set of NimbleGen CGH arrays that tile across the assayable portion of the cattle genome with approximately 6.3 million probes, at a median probe spacing of 301 bp. This study reports the highest resolution map of copy number variation in the cattle genome, with 304 CNV regions (CNVRs) being identified among the genomes of 20 bovine samples from 4 dairy and beef breeds. The CNVRs identified covered 0.68% (22 Mb) of the genome, and ranged in size from 1.7 to 2,031 kb (median size 16.7 kb). About 20% of the CNVs co-localized with segmental duplications, while 30% encompass genes, of which the majority is involved in environmental response. About 10% of the human orthologous of these genes are associated with human disease susceptibility and, hence, may have important phenotypic consequences. Conclusions Together, this analysis provides a useful resource for assessment of the impact of CNVs regarding variation in bovine health and production traits.
Collapse
Affiliation(s)
- João Fadista
- Group of Molecular Genetics and Systems Biology, Department of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University, Blichers Allé 20, DK-8830 Tjele, Denmark
| | | | | | | |
Collapse
|
396
|
Duan J, Sanders AR, Gejman PV. Genome-wide approaches to schizophrenia. Brain Res Bull 2010; 83:93-102. [PMID: 20433910 DOI: 10.1016/j.brainresbull.2010.04.009] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2009] [Revised: 04/20/2010] [Accepted: 04/21/2010] [Indexed: 12/25/2022]
Abstract
Schizophrenia (SZ) is a common and severe psychiatric disorder with both environmental and genetic risk factors, and a high heritability. After over 20 years of molecular genetics research, new molecular strategies, primarily genome-wide association studies (GWAS), have generated major tangible progress. This new data provides evidence for: (1) a number of chromosomal regions with common polymorphisms showing genome-wide association with SZ (the major histocompatibility complex, MHC, region at 6p22-p21; 18q21.2; and 2q32.1). The associated alleles present small odds ratios (the odds of a risk variant being present in cases vs. controls) and suggest causative involvement of gene regulatory mechanisms in SZ. (2) Polygenic inheritance. (3) Involvement of rare (<1%) and large (>100kb) copy number variants (CNVs). (4) A genetic overlap of SZ with autism and with bipolar disorder (BP) challenging the classical clinical classifications. Most new SZ findings (chromosomal regions and genes) have generated new biological leads. These new findings, however, still need to be translated into a better understanding of the underlying biology and into causal mechanisms. Furthermore, a considerable amount of heritability still remains unexplained (missing heritability). Deep resequencing for rare variants and system biology approaches (e.g., integrating DNA sequence and functional data) are expected to further improve our understanding of the genetic architecture of SZ and its underlying biology.
Collapse
Affiliation(s)
- Jubao Duan
- Center for Psychiatric Genetics, Department of Psychiatry and Behavioral Sciences, Northshore University HealthSystem Research Institute, 1001 University Place, Evanston, IL 60201, USA.
| | | | | |
Collapse
|
397
|
Need AC, Goldstein DB. Whole genome association studies in complex diseases: where do we stand? DIALOGUES IN CLINICAL NEUROSCIENCE 2010. [PMID: 20373665 PMCID: PMC3181943 DOI: 10.31887/dcns.2010.12.1/aneed] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Hundreds of genome-wide association studies have been performed in recent years in order to try to identify common variants that associate with complex disease. These have met with varying success. Some of the strongest effects of common variants have been found in lateonset diseases and in drug response. The major histocompatibility complex has also shown very strong association with a variety of disorders. Although there have been some notable success stories in neuropsychiatric genetics, on the whole, common variation has explained little of the high heritability of these traits. In contrast, early studies of rare copy number variants have led rapidly to a number of genes and loci that strongly associate with neuropsychiatric disorders. It is likely that the use of whole-genome sequencing to extend the study of rare variation in neuropsychiatry will greatly advance our understanding of neuropsychiatric genetics.
Collapse
Affiliation(s)
- Anna C Need
- Institute for Genome Sciences and Policy, Center for Human Genome Variation, Duke University, Durham, North Carolina 27708, USA
| | | |
Collapse
|
398
|
DNA copy number, including telomeres and mitochondria, assayed using next-generation sequencing. BMC Genomics 2010; 11:244. [PMID: 20398377 PMCID: PMC2867831 DOI: 10.1186/1471-2164-11-244] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 04/16/2010] [Indexed: 11/14/2022] Open
Abstract
Background DNA copy number variations occur within populations and aberrations can cause disease. We sought to develop an improved lab-automatable, cost-efficient, accurate platform to profile DNA copy number. Results We developed a sequencing-based assay of nuclear, mitochondrial, and telomeric DNA copy number that draws on the unbiased nature of next-generation sequencing and incorporates techniques developed for RNA expression profiling. To demonstrate this platform, we assayed UMC-11 cells using 5 million 33 nt reads and found tremendous copy number variation, including regions of single and homogeneous deletions and amplifications to 29 copies; 5 times more mitochondria and 4 times less telomeric sequence than a pool of non-diseased, blood-derived DNA; and that UMC-11 was derived from a male individual. Conclusion The described assay outputs absolute copy number, outputs an error estimate (p-value), and is more accurate than array-based platforms at high copy number. The platform enables profiling of mitochondrial levels and telomeric length. The assay is lab-automatable and has a genomic resolution and cost that are tunable based on the number of sequence reads.
Collapse
|
399
|
Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 2010; 20:623-35. [PMID: 20308636 DOI: 10.1101/gr.102970.109] [Citation(s) in RCA: 200] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Structural variation (SV) is a rich source of genetic diversity in mammals, but due to the challenges associated with mapping SV in complex genomes, basic questions regarding their genomic distribution and mechanistic origins remain unanswered. We have developed an algorithm (HYDRA) to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences. We applied these methods to two inbred mouse strains: C57BL/6J and DBA/2J. We demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications; however, our analysis of the C57BL/6J reference strain shows that incomplete reference genome assemblies are a major source of noise. We report 7196 SVs between the two strains, more than two-thirds of which are due to transposon insertions. Of the remainder, 59% are deletions (relative to the reference), 26% are insertions of unlinked DNA, 9% are tandem duplications, and 6% are inversions. To investigate the origins of SV, we characterized 3316 breakpoint sequences at single-nucleotide resolution. We find that approximately 16% of non-transposon SVs have complex breakpoint patterns consistent with template switching during DNA replication or repair, and that this process appears to preferentially generate certain classes of complex variants. Moreover, we find that SVs are significantly enriched in regions of segmental duplication, but that this effect is largely independent of DNA sequence homology and thus cannot be explained by non-allelic homologous recombination (NAHR) alone. This result suggests that the genetic instability of such regions is often the cause rather than the consequence of duplicated genomic architecture.
Collapse
Affiliation(s)
- Aaron R Quinlan
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, Virginia 22908, USA
| | | | | | | | | | | | | | | |
Collapse
|
400
|
Snyder M, Du J, Gerstein M. Personal genome sequencing: current approaches and challenges. Genes Dev 2010; 24:423-31. [PMID: 20194435 DOI: 10.1101/gad.1864110] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The revolution in DNA sequencing technologies has now made it feasible to determine the genome sequences of many individuals; i.e., "personal genomes." Genome sequences of cells and tissues from both normal and disease states have been determined. Using current approaches, whole human genome sequences are not typically assembled and determined de novo, but, instead, variations relative to a reference sequence are identified. We discuss the current state of personal genome sequencing, the main steps involved in determining a genome sequence (i.e., identifying single-nucleotide polymorphisms [SNPs] and structural variations [SVs], assembling new sequences, and phasing haplotypes), and the challenges and performance metrics for evaluating the accuracy of the reconstruction. Finally, we consider the possible individual and societal benefits of personal genome sequences.
Collapse
Affiliation(s)
- Michael Snyder
- Department of Genetics, Stanford University School of Medicine, California 94305, USA.
| | | | | |
Collapse
|