1
|
Fan X, Chaisson M, Nakhleh L, Chen K. HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies. Genome Res 2017; 27:793-800. [PMID: 28104618 PMCID: PMC5411774 DOI: 10.1101/gr.214767.116] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Accepted: 12/19/2016] [Indexed: 12/29/2022]
Abstract
Achieving complete, accurate, and cost-effective assembly of human genomes is of great importance for realizing the promise of precision medicine. The abundance of repeats and genetic variations in human genomes and the limitations of existing sequencing technologies call for the development of novel assembly methods that can leverage the complementary strengths of multiple technologies. We propose a Hybrid Structural variant Assembly (HySA) approach that integrates sequencing reads from next-generation sequencing and single-molecule sequencing technologies to accurately assemble and detect structural variants (SVs) in human genomes. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance the assembly of structurally altered regions in human genomes. We used data generated from a haploid hydatidiform mole genome (CHM1) and a diploid human genome (NA12878) to test our approach. The result showed that, compared with existing methods, our approach had a low false discovery rate and substantially improved the detection of many types of SVs, particularly novel large insertions, small indels (10–50 bp), and short tandem repeat expansions and contractions. Our work highlights the strengths and limitations of current approaches and provides an effective solution for extending the power of existing sequencing technologies for SV discovery.
Collapse
Affiliation(s)
- Xian Fan
- Department of Computer Science, Rice University, Houston, Texas 77005, USA.,Department of Bioinformatics and Computational Biology, Division of Quantitative Sciences, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Mark Chaisson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, Division of Quantitative Sciences, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| |
Collapse
|
2
|
Beal MA, Glenn TC, Somers CM. Whole genome sequencing for quantifying germline mutation frequency in humans and model species: cautious optimism. Mutat Res 2012; 750:96-106. [PMID: 22178956 DOI: 10.1016/j.mrrev.2011.11.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2011] [Revised: 11/29/2011] [Accepted: 11/30/2011] [Indexed: 05/31/2023]
Abstract
Factors affecting the type and frequency of germline mutations in animals are of significant interest from health and toxicology perspectives. However, studies in this field have been limited by the use of markers with low detection power or uncertain relevance to phenotype. Whole genome sequencing (WGS) is now a potential option to directly determine germline mutation type and frequency in family groups at all loci simultaneously. Medical studies have already capitalized on WGS to identify novel mutations in human families for clinical purposes, such as identifying candidate genes contributing to inherited conditions. However, WGS has not yet been used in any studies of vertebrates that aim to quantify changes in germline mutation frequency as a result of environmental factors. WGS is a promising tool for detecting mutation induction, but it is currently limited by several technical challenges. Perhaps the most pressing issue is sequencing error rates that are currently high in comparison to the intergenerational mutation frequency. Different platforms and depths of coverage currently result in a range of 10-10(3) false positives for every true mutation. In addition, the cost of WGS is still relatively high, particularly when comparing mutation frequencies among treatment groups with even moderate sample sizes. Despite these challenges, WGS offers the potential for unprecedented insight into germline mutation processes. Refinement of available tools and emergence of new technologies may be able to provide the improved accuracy and reduced costs necessary to make WGS viable in germline mutation studies in the very near future. To streamline studies, researchers may use multiple family triads per treatment group and sequence a targeted (reduced) portion of each genome with high (20-40 ×) depth of coverage. We are optimistic about the application of WGS for quantifying germline mutations, but caution researchers regarding the resource-intensive nature of the work using existing technology.
Collapse
Affiliation(s)
- Marc A Beal
- University of Regina, Department of Biology, 3737 Wascana Parkway, Regina, Saskatchewan, Canada S4S 0A2
| | - Travis C Glenn
- University of Georgia, Environmental Health Science, College of Public Health, Athens, GA 30602, USA
| | - Christopher M Somers
- University of Regina, Department of Biology, 3737 Wascana Parkway, Regina, Saskatchewan, Canada S4S 0A2.
| |
Collapse
|
3
|
Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. Fast computation and applications of genome mappability. PLoS One 2012; 7:e30377. [PMID: 22276185 PMCID: PMC3261895 DOI: 10.1371/journal.pone.0030377] [Citation(s) in RCA: 327] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2011] [Accepted: 12/19/2011] [Indexed: 01/17/2023] Open
Abstract
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
Collapse
Affiliation(s)
- Thomas Derrien
- Institut de Génétique et Développement (IGDR), Université Rennes 1, Rennes, France
- * E-mail: (TD); (PR)
| | - Jordi Estellé
- Centro Nacional de Análisis Genómico (CNAG), Barcelona, Spain
| | | | - David G. Knowles
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | | | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | - Paolo Ribeca
- Centro Nacional de Análisis Genómico (CNAG), Barcelona, Spain
- * E-mail: (TD); (PR)
| |
Collapse
|
4
|
Du J, Leng J, Habegger L, Sboner A, McDermott D, Gerstein M. IQSeq: integrated isoform quantification analysis based on next-generation sequencing. PLoS One 2012; 7:e29175. [PMID: 22238592 PMCID: PMC3253133 DOI: 10.1371/journal.pone.0029175] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2011] [Accepted: 11/22/2011] [Indexed: 12/31/2022] Open
Abstract
With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.
Collapse
Affiliation(s)
- Jiang Du
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
| | - Jing Leng
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
| | - Lukas Habegger
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
| | - Andrea Sboner
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Drew McDermott
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
| | - Mark Gerstein
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
- * E-mail:
| |
Collapse
|
5
|
Abstract
Structural variation (SV) encompasses diverse types of genomic variants including deletions, duplications, inversions, transpositions, translocations, and complex rearrangements, and is now recognized to be an abundant class of genetic variation in mammals. Different individuals, or strains, of a given species can differ by thousands of variants. However, despite a large number of studies over the past decade and impressive progress on many fronts, there remain significant gaps in our knowledge, particularly in species other than human. Arguably the most relevant among these are genetically tractable models such as mouse, rat, and dog. The emergence of efficient and affordable DNA sequencing technologies presents an opportunity to make rapid progress toward understanding the nature, origin, and function of SV in these, and other, domesticated species. Here, we summarize the current state of knowledge of SV in mammals, with a focus on the similarities and differences between domesticated species and human. We then present methods to identify SV breakpoints from next-generation sequence (NGS) data by paired-end mapping, split-read mapping, and local assembly, and discuss challenges that arise when interpreting these data in lineages with complex breeding histories and incomplete reference genomes. We further describe technical modifications that allow for identification of variants involving repetitive DNA elements such as transposons and segmental duplications. Finally, we explore a few of the key biological insights that can be gained by applying NGS methods to model organisms.
Collapse
Affiliation(s)
- Ira M Hall
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA, USA.
| | | |
Collapse
|
6
|
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011; 21:2224-41. [PMID: 21926179 DOI: 10.1101/gr.126599.111] [Citation(s) in RCA: 318] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Collapse
Affiliation(s)
- Dent Earl
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011. [PMID: 21926179 DOI: 10.1101/gr.126599] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Collapse
Affiliation(s)
- Dent Earl
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Kim JH, Kim WC, Li LM, Park S. HapEdit: an accuracy assessment viewer for haplotype assembly using massively parallel DNA-sequencing technologies. Nucleic Acids Res 2011; 39:W557-61. [PMID: 21576217 PMCID: PMC3125762 DOI: 10.1093/nar/gkr354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The massively parallel sequencing technologies have recently flourished and dramatically cut the cost to sequence personal human genomes. Haplotype assembly from personal genomes sequenced using the massively parallel sequencing technologies is becoming a cost-effective and promising tool for human disease study. Computational assembly of haplotypes has been proved to be very accurate, but obviously contains errors. Here we present a tool, HapEdit, to assess the accuracy of assembled haplotypes and edit them manually. Using this tool, a user can break erroneous haplotype segments into smaller segments, or concatenate haplotype segments if the concatenated haplotype segments are sufficiently supported. A user can also edit bases with low-quality scores. HapEdit displays haplotype assemblies so that a user can easily navigate and pinpoint a region of interest. As inputs, HapEdit currently takes reads from the Polonator, Illumina, SOLiD, 454 and Sanger sequencing technologies.
Collapse
Affiliation(s)
- Jong Hyun Kim
- Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA
| | | | | | | |
Collapse
|
9
|
Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE, Sahinalp SC. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. ACTA ACUST UNITED AC 2010; 26:i350-7. [PMID: 20529927 PMCID: PMC2881400 DOI: 10.1093/bioinformatics/btq216] [Citation(s) in RCA: 174] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507). Availability: The implementation of algorithm is available at http://compbio.cs.sfu.ca/strvar.htm. Contact:eee@gs.washington.edu; cenk@cs.sfu.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
Zhang C, Xing D. Single-Molecule DNA Amplification and Analysis Using Microfluidics. Chem Rev 2010; 110:4910-47. [DOI: 10.1021/cr900081z] [Citation(s) in RCA: 115] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- Chunsun Zhang
- MOE Key Laboratory of Laser Life Science & Institute of Laser Life Science, College of Biophotonics, South China Normal University, Guangzhou 510631, China
| | - Da Xing
- MOE Key Laboratory of Laser Life Science & Institute of Laser Life Science, College of Biophotonics, South China Normal University, Guangzhou 510631, China
| |
Collapse
|
11
|
Snyder M, Du J, Gerstein M. Personal genome sequencing: current approaches and challenges. Genes Dev 2010; 24:423-31. [PMID: 20194435 DOI: 10.1101/gad.1864110] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The revolution in DNA sequencing technologies has now made it feasible to determine the genome sequences of many individuals; i.e., "personal genomes." Genome sequences of cells and tissues from both normal and disease states have been determined. Using current approaches, whole human genome sequences are not typically assembled and determined de novo, but, instead, variations relative to a reference sequence are identified. We discuss the current state of personal genome sequencing, the main steps involved in determining a genome sequence (i.e., identifying single-nucleotide polymorphisms [SNPs] and structural variations [SVs], assembling new sequences, and phasing haplotypes), and the challenges and performance metrics for evaluating the accuracy of the reconstruction. Finally, we consider the possible individual and societal benefits of personal genome sequences.
Collapse
Affiliation(s)
- Michael Snyder
- Department of Genetics, Stanford University School of Medicine, California 94305, USA.
| | | | | |
Collapse
|
12
|
Mir KU. Sequencing genomes: from individuals to populations. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2010; 8:367-78. [PMID: 19808932 DOI: 10.1093/bfgp/elp040] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
The whole genome sequences of Jim Watson and Craig Venter are early examples of personalized genomics, which promises to change how we approach healthcare in the future. Before personal sequencing can have practical medical benefits, however, and before it should be advocated for implementation at the population-scale, there needs to be a better understanding of which genetic variants influence which traits and how their effects are modified by epigenetic factors. Nonetheless, for forging links between DNA sequence and phenotype, efforts to sequence the genomes of individuals need to continue; this includes sequencing sub-populations for association studies which analyse the difference in sequence between disease affected and unaffected individuals. Such studies can only be applied on a large enough scale to be effective if the massive strides in sequencing technology that have recently occurred also continue.
Collapse
Affiliation(s)
- Kalim U Mir
- The Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.
| |
Collapse
|