51
|
Delaneau O, Zagury JF, Robinson MR, Marchini JL, Dermitzakis ET. Accurate, scalable and integrative haplotype estimation. Nat Commun 2019; 10:5436. [PMID: 31780650 PMCID: PMC6882857 DOI: 10.1038/s41467-019-13225-y] [Citation(s) in RCA: 243] [Impact Index Per Article: 48.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Accepted: 10/10/2019] [Indexed: 01/28/2023] Open
Abstract
The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.
Collapse
Affiliation(s)
- Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Génopode, 1015, Lausanne, Switzerland. .,Swiss Institute of Bioinformatics (SIB), University of Lausanne, Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland.
| | - Jean-François Zagury
- Chaire de Bioinformatique, Laboratoire GBCM (EA7528), Conservatoire National des Arts et Métiers, HESAM Université, Paris, France
| | - Matthew R Robinson
- Department of Computational Biology, University of Lausanne, Génopode, 1015, Lausanne, Switzerland.,Swiss Institute of Bioinformatics (SIB), University of Lausanne, Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Jonathan L Marchini
- Department of Statistics, University of Oxford, 24-29 St. Giles, Oxford, OX1 3LB, UK
| | - Emmanouil T Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1 rue Michel-Servet, 1211, Geneva, Switzerland.,Swiss Institute of Bioinformatics (SIB), University of Geneva, 1 rue Michel-Servet, 1211, Geneva, Switzerland.,Institute of Genetics and Genomics in Geneva, University of Geneva Medical School, 1 rue Michel-Servet, 1211, Geneva, Switzerland
| |
Collapse
|
52
|
Bansal V. Integrating read-based and population-based phasing for dense and accurate haplotyping of individual genomes. Bioinformatics 2019; 35:i242-i248. [PMID: 31510646 PMCID: PMC6612846 DOI: 10.1093/bioinformatics/btz329] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Reconstruction of haplotypes for human genomes is an important problem in medical and population genetics. Hi-C sequencing generates read pairs with long-range haplotype information that can be computationally assembled to generate chromosome-spanning haplotypes. However, the haplotypes have limited completeness and low accuracy. Haplotype information from population reference panels can potentially be used to improve the completeness and accuracy of Hi-C haplotyping. RESULTS In this paper, we describe a likelihood based method to integrate short-range haplotype information from a population reference panel of haplotypes with the long-range haplotype information present in sequence reads from methods such as Hi-C to assemble dense and highly accurate haplotypes for individual genomes. Our method leverages a statistical phasing method and a maximum spanning tree algorithm to determine the optimal second-order approximation of the population-based haplotype likelihood for an individual genome. The population-based likelihood is encoded using pseudo-reads which are then used as input along with sequence reads for haplotype assembly using an existing tool, HapCUT2. Using whole-genome Hi-C data for two human genomes (NA19240 and NA12878), we demonstrate that this integrated phasing method enables the phasing of 97-98% of variants, reduces the switch error rates by 3-6-fold, and outperforms an existing method for combining phase information from sequence reads with population-based phasing. On Strand-seq data for NA12878, our method improves the haplotype completeness from 71.4 to 94.6% and reduces the switch error rate 2-fold, demonstrating its utility for phasing using multiple sequencing technologies. AVAILABILITY AND IMPLEMENTATION Code and datasets are available at https://github.com/vibansal/IntegratedPhasing.
Collapse
Affiliation(s)
- Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| |
Collapse
|
53
|
Somerville V, Lutz S, Schmid M, Frei D, Moser A, Irmler S, Frey JE, Ahrens CH. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol 2019; 19:143. [PMID: 31238873 PMCID: PMC6593500 DOI: 10.1186/s12866-019-1500-0] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Accepted: 05/31/2019] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Complete and contiguous genome assemblies greatly improve the quality of subsequent systems-wide functional profiling studies and the ability to gain novel biological insights. While a de novo genome assembly of an isolated bacterial strain is in most cases straightforward, more informative data about co-existing bacteria as well as synergistic and antagonistic effects can be obtained from a direct analysis of microbial communities. However, the complexity of metagenomic samples represents a major challenge. While third generation sequencing technologies have been suggested to enable finished metagenome-assembled genomes, to our knowledge, the complete genome assembly of all dominant strains in a microbiome sample has not been demonstrated. Natural whey starter cultures (NWCs) are used in cheese production and represent low-complexity microbiomes. Previous studies of Swiss Gruyère and selected Italian hard cheeses, mostly based on amplicon metagenomics, concurred that three species generally pre-dominate: Streptococcus thermophilus, Lactobacillus helveticus and Lactobacillus delbrueckii. RESULTS Two NWCs from Swiss Gruyère producers were subjected to whole metagenome shotgun sequencing using the Pacific Biosciences Sequel and Illumina MiSeq platforms. In addition, longer Oxford Nanopore Technologies MinION reads had to be generated for one to resolve repeat regions. Thereby, we achieved the complete assembly of all dominant bacterial genomes from these low-complexity NWCs, which was corroborated by a 16S rRNA amplicon survey. Moreover, two distinct L. helveticus strains were successfully co-assembled from the same sample. Besides bacterial chromosomes, we could also assemble several bacterial plasmids and phages and a corresponding prophage. Biologically relevant insights were uncovered by linking the plasmids and phages to their respective host genomes using DNA methylation motifs on the plasmids and by matching prokaryotic CRISPR spacers with the corresponding protospacers on the phages. These results could only be achieved by employing long-read sequencing data able to span intragenomic as well as intergenomic repeats. CONCLUSIONS Here, we demonstrate the feasibility of complete de novo genome assembly of all dominant strains from low-complexity NWCs based on whole metagenomics shotgun sequencing data. This allowed to gain novel biological insights and is a fundamental basis for subsequent systems-wide omics analyses, functional profiling and phenotype to genotype analysis of specific microbial communities.
Collapse
Affiliation(s)
- Vincent Somerville
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Stefanie Lutz
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Michael Schmid
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Daniel Frei
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
| | - Aline Moser
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, CH-3003 Bern, Switzerland
| | - Stefan Irmler
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, CH-3003 Bern, Switzerland
| | - Jürg E. Frey
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
| | - Christian H. Ahrens
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| |
Collapse
|
54
|
Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol 2019; 20:116. [PMID: 31159868 PMCID: PMC6547545 DOI: 10.1186/s13059-019-1709-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 05/06/2019] [Indexed: 12/19/2022] Open
Abstract
Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.
Collapse
Affiliation(s)
- Jana Ebler
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.
| |
Collapse
|
55
|
Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 2019. [PMID: 30992455 DOI: 10.1038/s41467‐018‐08148‐z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
Collapse
|
56
|
Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 2019; 10:1784. [PMID: 30992455 PMCID: PMC6467913 DOI: 10.1038/s41467-018-08148-z] [Citation(s) in RCA: 478] [Impact Index Per Article: 95.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 12/20/2018] [Indexed: 12/30/2022] Open
Abstract
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
Collapse
|
57
|
Tilleman L, Weymaere J, Heindryckx B, Deforce D, Nieuwerburgh FV. Contemporary pharmacogenetic assays in view of the PharmGKB database. Pharmacogenomics 2019; 20:261-272. [PMID: 30883266 DOI: 10.2217/pgs-2018-0167] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
AIM Six modern PGx assays were compared with the Pharmacogenomics Knowledge Base (PharmGKB) to determine the proportion of the currently known PGx genotypes that are assessed by these assays. MATERIALS & METHODS Investigated assays were 'Ion AmpliSeq Pharmacogenomics', 'iPLEX PGx Pro', 'DMET Plus,' 'PharmcoScan,' 'Living DNA' and '23andMe.' RESULTS PharmGKB contains 3474 clinical annotations of which 75, 70 and 45% can be determined by PharmacoScan, Living DNA and 23andMe, respectively. The other assays are designed to test a specific subset of PGx variants. CONCLUSION Assaying all known PGx variants would only comprise a minor fraction of the current assays' capacity. Unfortunately, this is not achieved. Moreover, not necessarily the variants with the highest effects or the highest evidence are selected.
Collapse
Affiliation(s)
- Laurentijn Tilleman
- Laboratory of Pharmaceutical Biotechnology, Ghent University, Ottergemsesteenweg 460, 9000 Ghent, Belgium
| | - Jana Weymaere
- Laboratory of Pharmaceutical Biotechnology, Ghent University, Ottergemsesteenweg 460, 9000 Ghent, Belgium
| | - Björn Heindryckx
- Ghent-Fertility & Stem Cell Team (G-FaST), Department for Reproductive Medicine, Ghent University Hospital, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Dieter Deforce
- Laboratory of Pharmaceutical Biotechnology, Ghent University, Ottergemsesteenweg 460, 9000 Ghent, Belgium
| | - Filip Van Nieuwerburgh
- Laboratory of Pharmaceutical Biotechnology, Ghent University, Ottergemsesteenweg 460, 9000 Ghent, Belgium
| |
Collapse
|
58
|
Lin YY, Wu PC, Chen PL, Oyang YJ, Chen CY. HAHap: a read-based haplotyping method using hierarchical assembly. PeerJ 2018; 6:e5852. [PMID: 30397550 PMCID: PMC6214236 DOI: 10.7717/peerj.5852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 09/27/2018] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. RESULTS This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.
Collapse
Affiliation(s)
- Yu-Yu Lin
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Ping Chun Wu
- Taipei Blood Center, Taiwan Blood Services Foundation, Taipei, Taiwan
| | - Pei-Lung Chen
- Graduate Institute of Medical Genomics and Proteomics, College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Yen-Jen Oyang
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Chien-Yu Chen
- Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
59
|
Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 2018; 36:nbt.4277. [PMID: 30346939 PMCID: PMC6476705 DOI: 10.1038/nbt.4277] [Citation(s) in RCA: 248] [Impact Index Per Article: 41.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 09/10/2018] [Indexed: 12/20/2022]
Abstract
Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly. In contrast with prior approaches, the effectiveness of our method improved with increasing heterozygosity. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. We used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches. We sequenced an F1 cross between the cattle subspecies Bos taurus taurus and Bos taurus indicus and completely assembled both parental haplotypes with NG50 haplotig sizes of >20 Mb and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We suggest that trio binning improves diploid genome assembly and will facilitate new studies of haplotype variation and inheritance.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Brian P. Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Alexander T. Dilthey
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
- Institute of Medical Microbiology, Heinrich-Heine-University Düsseldorf, Düsseldorf, North Rhine-Westphalia, Germany
| | - Derek M. Bickhart
- Cell Wall Biology and Utilization Laboratory, ARS USDA, Madison, Wisconsin, USA
| | | | - Stefan Hiendleder
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia
- Robinson Research Institute, The University of Adelaide, Adelaide SA, Australia
| | - John L. Williams
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia
| | | | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| |
Collapse
|
60
|
Suzuki Y, Wang Y, Au KF, Morishita S. A Statistical Method for Observing Personal Diploid Methylomes and Transcriptomes with Single-Molecule Real-Time Sequencing. Genes (Basel) 2018; 9:E460. [PMID: 30235838 PMCID: PMC6162384 DOI: 10.3390/genes9090460] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 09/12/2018] [Accepted: 09/12/2018] [Indexed: 11/16/2022] Open
Abstract
We address the problem of observing personal diploid methylomes, CpG methylome pairs of homologous chromosomes that are distinguishable with respect to phased heterozygous variants (PHVs), which is challenging due to scarcity of PHVs in personal genomes. Single molecule real-time (SMRT) sequencing is promising as it outputs long reads with CpG methylation information, but a serious concern is whether reliable PHVs are available in erroneous SMRT reads with an error rate of ∼15%. To overcome the issue, we propose a statistical model that reduces the error rate of phasing CpG site to 1%, thereby calling CpG hypomethylation in each haplotype with >90% precision and sensitivity. Using our statistical model, we examined GNAS complex locus known for a combination of maternally, paternally, or biallelically expressed isoforms, and observed allele-specific methylation pattern almost perfectly reflecting their respective allele-specific expression status, demonstrating the merit of elucidating comprehensive personal diploid methylomes and transcriptomes.
Collapse
Affiliation(s)
- Yuta Suzuki
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo 277-8561, Japan.
| | - Yunhao Wang
- Department of Internal Medicine, University of Iowa, Iowa City, IA 52242, USA.
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA 52242, USA.
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210, USA.
| | - Shinichi Morishita
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo 277-8561, Japan.
| |
Collapse
|
61
|
Beretta S, Patterson MD, Zaccaria S, Della Vedova G, Bonizzoni P. HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads. BMC Bioinformatics 2018; 19:252. [PMID: 29970002 PMCID: PMC6029272 DOI: 10.1186/s12859-018-2253-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 06/18/2018] [Indexed: 01/08/2023] Open
Abstract
Background Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. Results Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. Conclusions Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. Availability HapCHAT is available at http://hapchat.algolab.euunder the GNU Public License (GPL).
Collapse
Affiliation(s)
- Stefano Beretta
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Murray D Patterson
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy.
| | - Simone Zaccaria
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
62
|
Ghareghani M, Porubskỳ D, Sanders AD, Meiers S, Eichler EE, Korbel JO, Marschall T. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics 2018; 34:i115-i123. [PMID: 29949971 PMCID: PMC6022540 DOI: 10.1093/bioinformatics/bty290] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Motivation Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. Results To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. Availability and implementation https://github.com/daewoooo/SaaRclust.
Collapse
Affiliation(s)
- Maryam Ghareghani
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany
| | - David Porubskỳ
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
| | - Ashley D Sanders
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sascha Meiers
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Jan O Korbel
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
| |
Collapse
|
63
|
Garg S, Rautiainen M, Novak AM, Garrison E, Durbin R, Marschall T. A graph-based approach to diploid genome assembly. Bioinformatics 2018; 34:i105-i114. [PMID: 29949989 PMCID: PMC6022571 DOI: 10.1093/bioinformatics/bty279] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community. Results We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation https://github.com/whatshap/whatshap. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shilpa Garg
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, Germany
- Department of Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, Germany
- Department of Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Adam M Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, Germany
- Department of Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
| |
Collapse
|
64
|
van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The Third Revolution in Sequencing Technology. Trends Genet 2018; 34:666-681. [PMID: 29941292 DOI: 10.1016/j.tig.2018.05.008] [Citation(s) in RCA: 549] [Impact Index Per Article: 91.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 05/18/2018] [Accepted: 05/29/2018] [Indexed: 12/16/2022]
Abstract
Forty years ago the advent of Sanger sequencing was revolutionary as it allowed complete genome sequences to be deciphered for the first time. A second revolution came when next-generation sequencing (NGS) technologies appeared, which made genome sequencing much cheaper and faster. However, NGS methods have several drawbacks and pitfalls, most notably their short reads. Recently, third-generation/long-read methods appeared, which can produce genome assemblies of unprecedented quality. Moreover, these technologies can directly detect epigenetic modifications on native DNA and allow whole-transcript sequencing without the need for assembly. This marks the third revolution in sequencing technology. Here we review and compare the various long-read methods. We discuss their applications and their respective strengths and weaknesses and provide future perspectives.
Collapse
Affiliation(s)
- Erwin L van Dijk
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France.
| | - Yan Jaszczyszyn
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France
| | - Delphine Naquin
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France
| | - Claude Thermes
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France
| |
Collapse
|
65
|
Kyriakidou M, Tai HH, Anglin NL, Ellis D, Strömvik MV. Current Strategies of Polyploid Plant Genome Sequence Assembly. FRONTIERS IN PLANT SCIENCE 2018; 9:1660. [PMID: 30519250 PMCID: PMC6258962 DOI: 10.3389/fpls.2018.01660] [Citation(s) in RCA: 101] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Accepted: 10/25/2018] [Indexed: 05/14/2023]
Abstract
Polyploidy or duplication of an entire genome occurs in the majority of angiosperms. The understanding of polyploid genomes is important for the improvement of those crops, which humans rely on for sustenance and basic nutrition. As climate change continues to pose a potential threat to agricultural production, there will increasingly be a demand for plant cultivars that can resist biotic and abiotic stresses and also provide needed and improved nutrition. In the past decade, Next Generation Sequencing (NGS) has fundamentally changed the genomics landscape by providing tools for the exploration of polyploid genomes. Here, we review the challenges of the assembly of polyploid plant genomes, and also present recent advances in genomic resources and functional tools in molecular genetics and breeding. As genomes of diploid and less heterozygous progenitor species are increasingly available, we discuss the lack of complexity of these currently available reference genomes as they relate to polyploid crops. Finally, we review recent approaches of haplotyping by phasing and the impact of third generation technologies on polyploid plant genome assembly.
Collapse
Affiliation(s)
- Maria Kyriakidou
- Department of Plant Science, McGill University, Montreal, QC, Canada
| | - Helen H. Tai
- Fredericton Research and Development Centre, Agriculture and Agri-Food Canada, Fredericton, NB, Canada
| | | | | | - Martina V. Strömvik
- Department of Plant Science, McGill University, Montreal, QC, Canada
- *Correspondence: Martina V. Strömvik
| |
Collapse
|