601
|
Wang LY, Abyzov A, Korbel JO, Snyder M, Gerstein M. MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome Res 2008; 19:106-17. [PMID: 19037015 DOI: 10.1101/gr.080069.108] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Genome structural variation includes segmental duplications, deletions, and other rearrangements, and array-based comparative genomic hybridization (array-CGH) is a popular technology for determining this. Drawing relevant conclusions from array-CGH requires computational methods for partitioning the chromosome into segments of elevated, reduced, or unchanged copy number. Several approaches have been described, most of which attempt to explicitly model the underlying distribution of data based on particular assumptions. Often, they optimize likelihood functions for estimating model parameters, by expectation maximization or related approaches; however, this requires good parameter initialization through prespecifying the number of segments. Moreover, convergence is difficult to achieve, since many parameters are required to characterize an experiment. To overcome these limitations, we propose a nonparametric method without a global criterion to be optimized. Our method involves mean-shift-based (MSB) procedures; it considers the observed array-CGH signal as sampling from a probability-density function, uses a kernel-based approach to estimate local gradients for this function, and iteratively follows them to determine local modes of the signal. Overall, our method achieves robust discontinuity-preserving smoothing, thus accurately segmenting chromosomes into regions of duplication and deletion. It does not require the number of segments as input, nor does its convergence depend on this. We successfully applied our method to both simulated data and array-CGH experiments on glioblastoma and adenocarcinoma. We show that it performs at least as well as, and often better than, 10 previously published algorithms. Finally, we show that our approach can be extended to segmenting the signal resulting from the depth-of-coverage of mapped reads from next-generation sequencing.
Collapse
Affiliation(s)
- Lu-Yong Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | |
Collapse
|
603
|
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IMJ, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DMD, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara E Catenazzi M, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, et alBentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IMJ, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DMD, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara E Catenazzi M, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008; 456:53-9. [PMID: 18987734 PMCID: PMC2581791 DOI: 10.1038/nature07517] [Show More Authors] [Citation(s) in RCA: 2444] [Impact Index Per Article: 143.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2008] [Accepted: 10/02/2008] [Indexed: 11/15/2022]
Abstract
DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.
Collapse
Affiliation(s)
- David R Bentley
- Illumina Cambridge Ltd. (Formerly Solexa Ltd), Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
605
|
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genes Dev 2008; 18:1851-8. [PMID: 18714091 PMCID: PMC2577856 DOI: 10.1101/gr.078212.108] [Citation(s) in RCA: 1843] [Impact Index Per Article: 108.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2008] [Accepted: 08/13/2008] [Indexed: 11/24/2022]
Abstract
New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.
Collapse
Affiliation(s)
- Heng Li
- The Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom
| | - Jue Ruan
- Beijing Genomics Institute, Chinese Academy of Science, Beijing 100029, China
| | - Richard Durbin
- The Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom
| |
Collapse
|
606
|
Liu Z, Venkatesh SS, Maley CC. Sequence space coverage, entropy of genomes and the potential to detect non-human DNA in human samples. BMC Genomics 2008; 9:509. [PMID: 18973670 PMCID: PMC2628393 DOI: 10.1186/1471-2164-9-509] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2008] [Accepted: 10/30/2008] [Indexed: 12/17/2022] Open
Abstract
Background Genomes store information for building and maintaining organisms. Complete sequencing of many genomes provides the opportunity to study and compare global information properties of those genomes. Results We have analyzed aspects of the information content of Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli (K-12) genomes. Virtually all possible (> 98%) 12 bp oligomers appear in vertebrate genomes while < 2% of 19 bp oligomers are present. Other species showed different ranges of > 98% to < 2% of possible oligomers in D. melanogaster (12–17 bp), C. elegans (11–17 bp), A. thaliana (11–17 bp), S. cerevisiae (10–16 bp) and E. coli (9–15 bp). Frequencies of unique oligomers in the genomes follow similar patterns. We identified a set of 2.6 M 15-mers that are more than 1 nucleotide different from all 15-mers in the human genome and so could be used as probes to detect microbes in human samples. In a human sample, these probes would detect 100% of the 433 currently fully sequenced prokaryotes and 75% of the 3065 fully sequenced viruses. The human genome is significantly more compact in sequence space than a random genome. We identified the most frequent 5- to 20-mers in the human genome, which may prove useful as PCR primers. We also identified a bacterium, Anaeromyxobacter dehalogenans, which has an exceptionally low diversity of oligomers given the size of its genome and its GC content. The entropy of coding regions in the human genome is significantly higher than non-coding regions and chromosomes. However chromosomes 1, 2, 9, 12 and 14 have a relatively high proportion of coding DNA without high entropy, and chromosome 20 is the opposite with a low frequency of coding regions but relatively high entropy. Conclusion Measures of the frequency of oligomers are useful for designing PCR assays and for identifying chromosomes and organisms with hidden structure that had not been previously recognized. This information may be used to detect novel microbes in human tissues.
Collapse
Affiliation(s)
- Zhandong Liu
- Genomics and Computational Biology Graduate Group, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | | |
Collapse
|
607
|
Variable breakpoints target PAX5 in patients with dicentric chromosomes: a model for the basis of unbalanced translocations in cancer. Proc Natl Acad Sci U S A 2008; 105:17050-4. [PMID: 18957548 DOI: 10.1073/pnas.0803494105] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The search for target genes involved in unbalanced acquired chromosomal abnormalities has been largely unsuccessful, because the breakpoints of these rearrangements are too variable. Here, we use the example of dicentric chromosomes in B cell precursor acute lymphoblastic leukemia to show that, despite this heterogeneity, single genes are targeted through a variety of mechanisms. FISH showed that, although they were heterogeneous, breakpoints on 9p resulted in the partial or complete deletion of PAX5. Molecular copy number counting further delineated the breakpoints and facilitated cloning with long-distance inverse PCR. This approach identified 5 fusion gene partners with PAX5: LOC392027 (7p12.1), SLCO1B3 (12p12), ASXL1 (20q11.1), KIF3B (20q11.21), and C20orf112 (20q11.1). In each predicted fusion protein, the DNA-binding paired domain of PAX5 was present. Using quantitative PCR, we demonstrated that both the deletion and gene fusion events resulted in the same underexpression of PAX5, which extended to the differential expression of the PAX5 target genes, EBF1, ALDH1A1, ATP9A, and FLT3. Further molecular analysis showed deletion and mutation of the homologous PAX5 allele, providing further support for the key role of PAX5. Here, we show that specific gene loci may be the target of heterogeneous translocation breakpoints in human cancer, acting through a variety of mechanisms. This approach indicates an application for the identification of cancer genes in solid tumours, where unbalanced chromosomal rearrangements are particularly prevalent and few genes have been identified. It can be extrapolated that this strategy will reveal that the same mechanisms operate in cancer pathogenesis in general.
Collapse
|
608
|
ten Bosch JR, Grody WW. Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagn 2008; 10:484-92. [PMID: 18832462 DOI: 10.2353/jmoldx.2008.080027] [Citation(s) in RCA: 124] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The speed, accuracy, efficiency, and cost-effectiveness of DNA sequencing have been improving continuously since the initial derivation of the technique in the mid-1970s. With the advent of massively parallel sequencing technologies, DNA sequencing costs have been dramatically reduced. No longer is it unthinkable to sequence hundreds or even thousands of genes in a single individual with a suspected genetic disease or complex disease predisposition. Along with the benefits offered by these technologies come a number of challenges that must be addressed before wide-scale sequencing becomes accepted medical practice. Molecular diagnosticians will need to become comfortable with, and gain confidence in, these new platforms, which are based on radically different technologies compared to the standard DNA sequencers in routine use today. Experience will determine whether these instruments are best applied to sequencing versus resequencing. Perhaps most importantly, along with increasing read lengths inevitably comes increased ascertainment of novel sequence variants of uncertain clinical significance, the postanalytical aspects of which could bog down the entire field. But despite these obstacles, and as a direct result of the promises these sequencing advances present, it will likely not be long before next-generation sequencing begins to make an impact in molecular medicine. In this review, technical issues are discussed, in addition to the practical considerations that will need to be addressed as advances push toward personal genome sequencing.
Collapse
Affiliation(s)
- John R ten Bosch
- Departments of Human Genetics, University of California at Los Angeles School of Medicine, Los Angeles, California, USA
| | | |
Collapse
|