1
|
Li Y, Lin Y. DCHap: A Divide-and-Conquer Haplotype Phasing Algorithm for Third-Generation Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1277-1284. [PMID: 32750878 DOI: 10.1109/tcbb.2020.3005673] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The development of DNA sequencing technologies makes it possible to obtain reads originated from both copies of a chromosome (two parental chromosomes, or haplotypes) of a single individual. Reconstruction of both haplotypes (i.e., haplotype phasing) plays a crucial role in genetic analysis and provides relationship information between genetic variation and disease susceptibility. With the emerging third-generation sequencing technologies, most existing approaches for haplotype phasing suffer from performance issues to handle long and error-prone reads. We develop a divide-and-conquer algorithm, DCHap, to phase haplotypes using third-generation reads. We benchmark DCHap against three state-of-the-art phasing tools on both PacBio SMRT data and ONT Nanopore data. The experimental results show that DCHap generates more accurate or comparable results (measured by the switch errors) while being scalable for higher coverage and longer reads. DCHap is a fast and accurate algorithm for haplotype phasing using third-generation sequencing data. As the third-generation sequencing platforms continue improving on their throughput and read lengths, accurate and scalable tools like DCHap are important to improve haplotype phasing from the advances of sequencing technologies. The source code is freely available at https://github.com/yanboANU/Haplotype-phasing.
Collapse
|
2
|
Markowski J, Kempfer R, Kukalev A, Irastorza-Azcarate I, Loof G, Kehr B, Pombo A, Rahmann S, Schwarz RF. GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data. Bioinformatics 2021; 37:3128-3135. [PMID: 33830196 PMCID: PMC8504635 DOI: 10.1093/bioinformatics/btab238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 02/22/2021] [Accepted: 04/07/2021] [Indexed: 11/24/2022] Open
Abstract
MOTIVATION Genome Architecture Mapping (GAM) was recently introduced as a digestion- and ligation-free method to detect chromatin conformation. Orthogonal to existing approaches based on chromatin conformation capture (3C), GAM's ability to capture both inter- and intra-chromosomal contacts from low amounts of input data makes it particularly well suited for allele-specific analyses in a clinical setting. Allele-specific analyses are powerful tools to investigate the effects of genetic variants on many cellular phenotypes including chromatin conformation, but require the haplotypes of the individuals under study to be known a priori. So far, however, no algorithm exists for haplotype reconstruction and phasing of genetic variants from GAM data, hindering the allele-specific analysis of chromatin contact points in non-model organisms or individuals with unknown haplotypes. RESULTS We present GAMIBHEAR, a tool for accurate haplotype reconstruction from GAM data. GAMIBHEAR aggregates allelic co-observation frequencies from GAM data and employs a GAM-specific probabilistic model of haplotype capture to optimize phasing accuracy. Using a hybrid mouse embryonic stem cell line with known haplotype structure as a benchmark dataset, we assess correctness and completeness of the reconstructed haplotypes, and demonstrate the power of GAMIBHEAR to infer accurate genome-wide haplotypes from GAM data. AVAILABILITY AND IMPLEMENTATION GAMIBHEAR is available as an R package under the open-source GPL-2 license at https://bitbucket.org/schwarzlab/gamibhear. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Julia Markowski
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
- Humboldt-Universität zu Berlin, Department of Biology, 10099 Berlin, Germany
| | - Rieke Kempfer
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
- Humboldt-Universität zu Berlin, Department of Biology, 10099 Berlin, Germany
| | - Alexander Kukalev
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
| | - Ibai Irastorza-Azcarate
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
| | - Gesa Loof
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
- Humboldt-Universität zu Berlin, Department of Biology, 10099 Berlin, Germany
| | - Birte Kehr
- Berlin Institute of Health (BIH) at Charité–Universitätsmedizin Berlin, 10177 Berlin, Germany
- Regensburg Center for Interventional Immunology (RCI), 93053 Regensburg, Germany
- Universität Regensburg, 93053 Regensburg, Germany
| | - Ana Pombo
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
- Humboldt-Universität zu Berlin, Department of Biology, 10099 Berlin, Germany
| | - Sven Rahmann
- Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, 45122 Essen, Germany
| | - Roland F Schwarz
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 10115 Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data (BIFOLD), 10623 Berlin, Germany
| |
Collapse
|
3
|
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2019; 19:329-346. [PMID: 29599501 DOI: 10.1038/s41576-018-0003-4] [Citation(s) in RCA: 291] [Impact Index Per Article: 58.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.
Collapse
Affiliation(s)
- Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Hayan Lee
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
4
|
Beretta S, Patterson MD, Zaccaria S, Della Vedova G, Bonizzoni P. HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads. BMC Bioinformatics 2018; 19:252. [PMID: 29970002 PMCID: PMC6029272 DOI: 10.1186/s12859-018-2253-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 06/18/2018] [Indexed: 01/08/2023] Open
Abstract
Background Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. Results Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. Conclusions Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. Availability HapCHAT is available at http://hapchat.algolab.euunder the GNU Public License (GPL).
Collapse
Affiliation(s)
- Stefano Beretta
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Murray D Patterson
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy.
| | - Simone Zaccaria
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
5
|
Edge P, Bafna V, Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res 2016; 27:801-812. [PMID: 27940952 PMCID: PMC5411775 DOI: 10.1101/gr.213462.116] [Citation(s) in RCA: 199] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/08/2016] [Indexed: 11/24/2022]
Abstract
Many tools have been developed for haplotype assembly—the reconstruction of individual haplotypes using reads mapped to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes. However, existing computational methods designed to handle specific technologies do not scale well on data from different protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple sequencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types—dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation (Hi-C) sequencing—we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-resolution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromosomes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse sequencing technologies.
Collapse
Affiliation(s)
- Peter Edge
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA
| | - Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California 92053, USA
| |
Collapse
|
6
|
Abstract
Human genomes are diploid and, for their complete description and interpretation, it is necessary not only to discover the variation they contain but also to arrange it onto chromosomal haplotypes. Although whole-genome sequencing is becoming increasingly routine, nearly all such individual genomes are mostly unresolved with respect to haplotype, particularly for rare alleles, which remain poorly resolved by inferential methods. Here, we review emerging technologies for experimentally resolving (that is, 'phasing') haplotypes across individual whole-genome sequences. We also discuss computational methods relevant to their implementation, metrics for assessing their accuracy and completeness, and the relevance of haplotype information to applications of genome sequencing in research and clinical medicine.
Collapse
|
7
|
Lo C, Liu R, Lee J, Robasky K, Byrne S, Lucchesi C, Aach J, Church G, Bafna V, Zhang K. On the design of clone-based haplotyping. Genome Biol 2015; 14:R100. [PMID: 24028704 PMCID: PMC4053695 DOI: 10.1186/gb-2013-14-9-r100] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Accepted: 09/04/2013] [Indexed: 11/16/2022] Open
Abstract
Background Haplotypes are important for assessing genealogy and disease susceptibility of individual genomes, but are difficult to obtain with routine sequencing approaches. Experimental haplotype reconstruction based on assembling fragments of individual chromosomes is promising, but with variable yields due to incompletely understood parameter choices. Results We parameterize the clone-based haplotyping problem in order to provide theoretical and empirical assessments of the impact of different parameters on haplotype assembly. We confirm the intuition that long clones help link together heterozygous variants and thus improve haplotype length. Furthermore, given the length of the clones, we address how to choose the other parameters, including number of pools, clone coverage and sequencing coverage, so as to maximize haplotype length. We model the problem theoretically and show empirically the benefits of using larger clones with moderate number of pools and sequencing coverage. In particular, using 140 kb BAC clones, we construct haplotypes for a personal genome and assemble haplotypes with N50 values greater than 2.6 Mb. These assembled haplotypes are longer and at least as accurate as haplotypes of existing clone-based strategies, whether in vivo or in vitro. Conclusions Our results provide practical guidelines for the development and design of clone-based methods to achieve long range, high-resolution and accurate haplotypes.
Collapse
|
8
|
Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 2014; 46:1343-9. [PMID: 25326703 DOI: 10.1038/ng.3119] [Citation(s) in RCA: 122] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2014] [Accepted: 09/24/2014] [Indexed: 12/11/2022]
Abstract
Haplotype-resolved genome sequencing enables the accurate interpretation of medically relevant genetic variation, deep inferences regarding population history and non-invasive prediction of fetal genomes. We describe an approach for genome-wide haplotyping based on contiguity-preserving transposition (CPT-seq) and combinatorial indexing. Tn5 transposition is used to modify DNA with adaptor and index sequences while preserving contiguity. After DNA dilution and compartmentalization, the transposase is removed, resolving the DNA into individually indexed libraries. The libraries in each compartment, enriched for neighboring genomic elements, are further indexed via PCR. Combinatorial 96-plex indexing at both the transposition and PCR stage enables the construction of phased synthetic reads from each of the nearly 10,000 'virtual compartments'. We demonstrate the feasibility of this method by assembling >95% of the heterozygous variants in a human genome into long, accurate haplotype blocks (N50 = 1.4-2.3 Mb). The rapid, scalable and cost-effective workflow could enable haplotype resolution to become routine in human genome sequencing.
Collapse
|
9
|
Abstract
Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.
Collapse
Affiliation(s)
- Gustavo Glusman
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| | - Hannah C Cox
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| | - Jared C Roach
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| |
Collapse
|
10
|
Abstract
MOTIVATION Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington's diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs. RESULTS Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.
Collapse
Affiliation(s)
- Ajay Ummat
- Department of Genetics and Genomic Science and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ali Bashir
- Department of Genetics and Genomic Science and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
11
|
Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G. High throughput sequencing methods and analysis for microbiome research. J Microbiol Methods 2013; 95:401-14. [PMID: 24029734 DOI: 10.1016/j.mimet.2013.08.011] [Citation(s) in RCA: 147] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Revised: 08/13/2013] [Accepted: 08/13/2013] [Indexed: 02/07/2023]
Abstract
High-throughput sequencing technology is rapidly improving in quality, speed and cost. It is therefore becoming more widely used to study whole communities of prokaryotes in many niches. This review discusses these techniques, including nucleic acid extraction from different environments, sample preparation and high-throughput sequencing platforms. We also discuss commonly used and recently developed bioinformatic tools applied to microbiomes, including analyzing amplicon sequences, metagenome shotgun sequences and metatranscriptome sequences. This field is relatively new and rapidly evolving, thus we hope that this review will provide a baseline for understanding these methods of microbiome analyses. Additionally, we seek to stimulate others to solve the many problems that still exist with the sensitivity, specificity and interpretation of high throughput microbiome sequence analysis.
Collapse
Affiliation(s)
- Julia M Di Bella
- Department of Microbiology and Immunology, The University of Western Ontario, London, ON, Canada
| | | | | | | | | |
Collapse
|
12
|
Krasileva KV, Buffalo V, Bailey P, Pearce S, Ayling S, Tabbita F, Soria M, Wang S, Akhunov E, Uauy C, Dubcovsky J. Separating homeologs by phasing in the tetraploid wheat transcriptome. Genome Biol 2013; 14:R66. [PMID: 23800085 PMCID: PMC4053977 DOI: 10.1186/gb-2013-14-6-r66] [Citation(s) in RCA: 115] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2013] [Accepted: 06/25/2013] [Indexed: 11/10/2022] Open
Abstract
Background The high level of identity among duplicated homoeologous genomes in tetraploid pasta wheat presents substantial challenges for de novo transcriptome assembly. To solve this problem, we develop a specialized bioinformatics workflow that optimizes transcriptome assembly and separation of merged homoeologs. To evaluate our strategy, we sequence and assemble the transcriptome of one of the diploid ancestors of pasta wheat, and compare both assemblies with a benchmark set of 13,472 full-length, non-redundant bread wheat cDNAs. Results A total of 489 million 100 bp paired-end reads from tetraploid wheat assemble in 140,118 contigs, including 96% of the benchmark cDNAs. We used a comparative genomics approach to annotate 66,633 open reading frames. The multiple k-mer assembly strategy increases the proportion of cDNAs assembled full-length in a single contig by 22% relative to the best single k-mer size. Homoeologs are separated using a post-assembly pipeline that includes polymorphism identification, phasing of SNPs, read sorting, and re-assembly of phased reads. Using a reference set of genes, we determine that 98.7% of SNPs analyzed are correctly separated by phasing. Conclusions Our study shows that de novo transcriptome assembly of tetraploid wheat benefit from multiple k-mer assembly strategies more than diploid wheat. Our results also demonstrate that phasing approaches originally designed for heterozygous diploid organisms can be used to separate the close homoeologous genomes of tetraploid wheat. The predicted tetraploid wheat proteome and gene models provide a valuable tool for the wheat research community and for those interested in comparative genomic studies.
Collapse
|
13
|
Xie M, Wang J, Jiang T. A fast and accurate algorithm for single individual haplotyping. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S8. [PMID: 23282221 PMCID: PMC3521186 DOI: 10.1186/1752-0509-6-s2-s8] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Background Due to the difficulty in separating two (paternal and maternal) copies of a chromosome, most published human genome sequences only provide genotype information, i.e., the mixed information of the underlying two haplotypes. However, phased haplotype information is needed to completely understand complex genetic polymorphisms and to increase the power of genome-wide association studies for complex diseases. With the rapid development of DNA sequencing technologies, reconstructing a pair of haplotypes from an individual's aligned DNA fragments by computer algorithms (i.e., Single Individual Haplotyping) has become a practical haplotyping approach. Results In the paper, we combine two measures "errors corrected" and "fragments cut" and propose a new optimization model, called Balanced Optimal Partition (BOP), for single individual haplotyping. The model generalizes two existing models, Minimum Error Correction (MEC) and Maximum Fragments Cut (MFC), and could be made either model by using some extreme parameter values. To solve the model, we design a heuristic dynamic programming algorithm H-BOP. By limiting the number of intermediate solutions at each iteration to an appropriately chosen small integer k, H-BOP is able to solve the model efficiently. Conclusions Extensive experimental results on simulated and real data show that when k = 8, H-BOP is generally faster and more accurate than a recent state-of-art algorithm ReFHap in haplotype reconstruction. The running time of H-BOP is linearly dependent on some of the key parameters controlling the input size and H-BOP scales well to large input data. The code of H-BOP is available to the public for free upon request to the corresponding author.
Collapse
Affiliation(s)
- Minzhu Xie
- College of Physics and Information Science, Hunan Normal University, Changsha 410081, PR China.
| | | | | |
Collapse
|
14
|
Luciani F, Bull RA, Lloyd AR. Next generation deep sequencing and vaccine design: today and tomorrow. Trends Biotechnol 2012; 30:443-52. [PMID: 22721705 PMCID: PMC7127335 DOI: 10.1016/j.tibtech.2012.05.005] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Revised: 05/21/2012] [Accepted: 05/21/2012] [Indexed: 12/20/2022]
Abstract
Next generation sequencing (NGS) technologies have redefined the modus operandi in both human and microbial genetics research, allowing the unprecedented generation of very large sequencing datasets on a short time scale and at affordable costs. Vaccine development research is rapidly taking full advantage of the advent of NGS. This review provides a concise summary of the current applications of NGS in relation to research seeking to develop vaccines for human infectious diseases, incorporating studies of both the pathogen and the host. We focus on rapidly mutating viral pathogens, which are major targets in current vaccine research. NGS is unraveling the complex dynamics of viral evolution and host responses against these viruses, thus contributing substantially to the likelihood of successful vaccine development.
Collapse
Affiliation(s)
- Fabio Luciani
- Inflammation and Infection Research Centre, School of Medical Sciences, University of New South Wales, Sydney, Australia.
| | | | | |
Collapse
|
15
|
Stranneheim H, Lundeberg J. Stepping stones in DNA sequencing. Biotechnol J 2012; 7:1063-73. [PMID: 22887891 PMCID: PMC3472021 DOI: 10.1002/biot.201200153] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2012] [Revised: 06/21/2012] [Accepted: 07/04/2012] [Indexed: 12/11/2022]
Abstract
In recent years there have been tremendous advances in our ability to rapidly and cost-effectively sequence DNA. This has revolutionized the fields of genetics and biology, leading to a deeper understanding of the molecular events in life processes. The rapid technological advances have enormously expanded sequencing opportunities and applications, but also imposed strains and challenges on steps prior to sequencing and in the downstream process of handling and analysis of these massive amounts of sequence data. Traditionally, sequencing has been limited to small DNA fragments of approximately one thousand bases (derived from the organism's genome) due to issues in maintaining a high sequence quality and accuracy for longer read lengths. Although many technological breakthroughs have been made, currently the commercially available massively parallel sequencing methods have not been able to resolve this issue. However, recent announcements in nanopore sequencing hold the promise of removing this read-length limitation, enabling sequencing of larger intact DNA fragments. The ability to sequence longer intact DNA with high accuracy is a major stepping stone towards greatly simplifying the downstream analysis and increasing the power of sequencing compared to today. This review covers some of the technical advances in sequencing that have opened up new frontiers in genomics.
Collapse
Affiliation(s)
- Henrik Stranneheim
- Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden.
| | | |
Collapse
|
16
|
Duitama J, McEwen GK, Huebsch T, Palczewski S, Schulz S, Verstrepen K, Suk EK, Hoehe MR. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res 2011; 40:2041-53. [PMID: 22102577 PMCID: PMC3299995 DOI: 10.1093/nar/gkr1042] [Citation(s) in RCA: 94] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.
Collapse
Affiliation(s)
- Jorge Duitama
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany.
| | | | | | | | | | | | | | | |
Collapse
|