1
|
Henglin M, Ghareghani M, Harvey WT, Porubsky D, Koren S, Eichler EE, Ebert P, Marschall T. Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing. Genome Biol 2024; 25:265. [PMID: 39390579 PMCID: PMC11466045 DOI: 10.1186/s13059-024-03409-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 09/30/2024] [Indexed: 10/12/2024] Open
Abstract
Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.
Collapse
Affiliation(s)
- Mir Henglin
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Maryam Ghareghani
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
- Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| |
Collapse
|
2
|
Zhang T, Zhou J, Gao W, Jia Y, Wei Y, Wang G. Complex genome assembly based on long-read sequencing. Brief Bioinform 2022; 23:6657663. [PMID: 35940845 DOI: 10.1093/bib/bbac305] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/20/2022] [Accepted: 07/06/2022] [Indexed: 11/12/2022] Open
Abstract
High-quality genome chromosome-scale sequences provide an important basis for genomics downstream analysis, especially the construction of haplotype-resolved and complete genomes, which plays a key role in genome annotation, mutation detection, evolutionary analysis, gene function research, comparative genomics and other aspects. However, genome-wide short-read sequencing is difficult to produce a complete genome in the face of a complex genome with high duplication and multiple heterozygosity. The emergence of long-read sequencing technology has greatly improved the integrity of complex genome assembly. We review a variety of computational methods for complex genome assembly and describe in detail the theories, innovations and shortcomings of collapsed, semi-collapsed and uncollapsed assemblers based on long reads. Among the three methods, uncollapsed assembly is the most correct and complete way to represent genomes. In addition, genome assembly is closely related to haplotype reconstruction, that is uncollapsed assembly realizes haplotype reconstruction, and haplotype reconstruction promotes uncollapsed assembly. We hope that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yanan Wei
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
3
|
Khan K, Ahram DF, Liu YP, Westland R, Sampogna RV, Katsanis N, Davis EE, Sanna-Cherchi S. Multidisciplinary approaches for elucidating genetics and molecular pathogenesis of urinary tract malformations. Kidney Int 2022; 101:473-484. [PMID: 34780871 PMCID: PMC8934530 DOI: 10.1016/j.kint.2021.09.034] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 09/15/2021] [Accepted: 09/30/2021] [Indexed: 12/28/2022]
Abstract
Advances in clinical diagnostics and molecular tools have improved our understanding of the genetically heterogeneous causes underlying congenital anomalies of kidney and urinary tract (CAKUT). However, despite a sharp incline of CAKUT reports in the literature within the past 2 decades, there remains a plateau in the genetic diagnostic yield that is disproportionate to the accelerated ability to generate robust genome-wide data. Explanations for this observation include (i) diverse inheritance patterns with incomplete penetrance and variable expressivity, (ii) rarity of single-gene drivers such that large sample sizes are required to meet the burden of proof, and (iii) multigene interactions that might produce either intra- (e.g., copy number variants) or inter- (e.g., effects in trans) locus effects. These challenges present an opportunity for the community to implement innovative genetic and molecular avenues to explain the missing heritability and to better elucidate the mechanisms that underscore CAKUT. Here, we review recent multidisciplinary approaches at the intersection of genetics, genomics, in vivo modeling, and in vitro systems toward refining a blueprint for overcoming the diagnostic hurdles that are pervasive in urinary tract malformation cohorts. These approaches will not only benefit clinical management by reducing age at molecular diagnosis and prompting early evaluation for comorbid features but will also serve as a springboard for therapeutic development.
Collapse
Affiliation(s)
- Kamal Khan
- Center for Human Disease Modeling, Duke University, Durham, North Carolina, USA.,Stanley Manne Children’s Research Institute, Ann & Robert H. Lurie Children’s Hospital of Chicago, Chicago, Illinois, USA (current address)
| | - Dina F. Ahram
- Division of Nephrology, Columbia University, New York, USA
| | - Yangfan P. Liu
- Center for Human Disease Modeling, Duke University, Durham, North Carolina, USA
| | - Rik Westland
- Division of Nephrology, Columbia University, New York, USA.,Department of Pediatric Nephrology, Amsterdam UMC- Emma Children’s Hospital, Amsterdam, NL
| | | | - Nicholas Katsanis
- Center for Human Disease Modeling, Duke University, Durham, North Carolina, USA; Stanley Manne Children's Research Institute, Ann & Robert H. Lurie Children's Hospital of Chicago, Chicago, Illinois, USA (current address); Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA; Department of Cell and Developmental Biology, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA.
| | - Erica E. Davis
- Center for Human Disease Modeling, Duke University, Durham, North Carolina, USA.,Stanley Manne Children’s Research Institute, Ann & Robert H. Lurie Children’s Hospital of Chicago, Chicago, Illinois, USA (current address).,Department of Pediatrics and Department of Cell and Molecular Biology, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA.,To whom correspondence should be addressed: ADDRESS CORRESPONDENCE TO: Simone Sanna-Cherchi, MD, Division of Nephrology, Columbia University, College of Physicians and Surgeons, New York, NY 10032, USA; Phone: 212-851-4925; Fax: 212-851-5461; . Erica E. Davis, PhD, Stanley Manne Children’s Research Institute, Ann & Robert H. Lurie Children’s Hospital of Chicago, Chicago, IL 60611, USA; Phone: 312-503-7662; Fax: 312-503-7343; , Nicholas Katsanis, PhD, Stanley Manne Children’s Research Institute, Ann & Robert H. Lurie Children’s Hospital of Chicago, Chicago, IL 60611, USA; Phone: 312-503-7339; Fax: 312-503-7343;
| | - Simone Sanna-Cherchi
- Department of Medicine, Division of Nephrology, Columbia University Irving Medical Center, New York, New York, USA.
| |
Collapse
|
4
|
Genome assembly and annotation. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00013-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
5
|
Luo X, Kang X, Schönhuth A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol 2021; 22:299. [PMID: 34706745 PMCID: PMC8549298 DOI: 10.1186/s13059-021-02512-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 10/05/2021] [Indexed: 01/27/2023] Open
Abstract
Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly. However, current long-read assemblers are either reference based, so introduce biases, or fail to capture the haplotype diversity of diploid genomes. We present phasebook, a de novo approach for reconstructing the haplotypes of diploid genomes from long reads. phasebook outperforms other approaches in terms of haplotype coverage by large margins, in addition to achieving competitive performance in terms of assembly errors and assembly contiguity.
Collapse
Affiliation(s)
- Xiao Luo
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Xiongbin Kang
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Alexander Schönhuth
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands.
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
6
|
Prezza N, Pisanti N, Sciortino M, Rosone G. Variable-order reference-free variant discovery with the Burrows-Wheeler Transform. BMC Bioinformatics 2020; 21:260. [PMID: 32938358 PMCID: PMC7493873 DOI: 10.1186/s12859-020-03586-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Accepted: 06/08/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. RESULTS In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel . CONCLUSIONS Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.
Collapse
Affiliation(s)
- Nicola Prezza
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Nadia Pisanti
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Marinella Sciortino
- Dipartimento di Matematica e Informatica, Università di Palermo, Via Archirafi, 34, Palermo, Italy
| | - Giovanna Rosone
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy.
| |
Collapse
|
7
|
Sankararaman A, Vikalo H, Baccelli F. ComHapDet: a spatial community detection algorithm for haplotype assembly. BMC Genomics 2020; 21:586. [PMID: 32900369 PMCID: PMC7488034 DOI: 10.1186/s12864-020-06935-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. RESULTS We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. CONCLUSIONS Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Collapse
Affiliation(s)
- Abishek Sankararaman
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - François Baccelli
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.,Department of Mathematics, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
8
|
Baaijens JA, Schönhuth A. Overlap graph-based generation of haplotigs for diploids and polyploids. Bioinformatics 2020; 35:4281-4289. [PMID: 30994902 DOI: 10.1093/bioinformatics/btz255] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 03/18/2019] [Accepted: 04/11/2019] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. RESULTS We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. AVAILABILITY AND IMPLEMENTATION POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Alexander Schönhuth
- Centrum Wiskunde & Informatica, XG Amsterdam, The Netherlands.,Theoretical Biology and Bioinformatics, Utrecht University, CH Utrecht, The Netherlands
| |
Collapse
|
9
|
Yan Z, Zhu X, Wang Y, Nie Y, Guan S, Kuo Y, Chang D, Li R, Qiao J, Yan L. scHaplotyper: haplotype construction and visualization for genetic diagnosis using single cell DNA sequencing data. BMC Bioinformatics 2020; 21:41. [PMID: 32007105 PMCID: PMC6995221 DOI: 10.1186/s12859-020-3381-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Accepted: 01/22/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Haplotyping reveals chromosome blocks inherited from parents to in vitro fertilized (IVF) embryos in preimplantation genetic diagnosis (PGD), enabling the observation of the transmission of disease alleles between generations. However, the methods of haplotyping that are suitable for single cells are limited because a whole genome amplification (WGA) process is performed before sequencing or genotyping in PGD, and true haplotype profiles of embryos need to be constructed based on genotypes that can contain many WGA artifacts. RESULTS Here, we offer scHaplotyper as a genetic diagnosis tool that reconstructs and visualizes the haplotype profiles of single cells based on the Hidden Markov Model (HMM). scHaplotyper can trace the origin of each haplotype block in the embryo, enabling the detection of carrier status of disease alleles in each embryo. We applied this method in PGD in two families affected with genetic disorders, and the result was the healthy live births of two children in the two families, demonstrating the clinical application of this method. CONCLUSION Next generation sequencing (NGS) of preimplantation embryos enable genetic screening for families with genetic disorders, avoiding the birth of affected babies. With the validation and successful clinical application, we showed that scHaplotyper is a convenient and accurate method to screen out embryos. More patients with genetic disorder will benefit from the genetic diagnosis of embryos. The source code of scHaplotyper is available at GitHub repository: https://github.com/yzqheart/scHaplotyper.
Collapse
Affiliation(s)
- Zhiqiang Yan
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China.,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.,Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Xiaohui Zhu
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Yuqian Wang
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Yanli Nie
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Shuo Guan
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Ying Kuo
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Di Chang
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Rong Li
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Jie Qiao
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China.,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.,Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China.,Beijing Advanced Innovation Center for Genomics (ICG), Peking University, Beijing, 100871, China
| | - Liying Yan
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China. .,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China. .,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China.
| |
Collapse
|
10
|
Mazrouee S, Wang W. PolyCluster: Minimum Fragment Disagreement Clustering for Polyploid Phasing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:264-277. [PMID: 30040655 DOI: 10.1109/tcbb.2018.2858803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Phasing is an emerging area in computational biology with important applications in clinical decision making and biomedical sciences. While machine learning techniques have shown tremendous potential in many biomedical applications, their utility in phasing has not yet been fully understood. In this paper, we investigate development of clustering-based techniques for phasing in polyploidy organisms where more than two copies of each chromosome exist in the cells of the organism under study. We develop a novel framework, called PolyCluster, based on the concept of correlation clustering followed by an effective cluster merging mechanism to minimize the amount of disagreement among short reads residing in each cluster. We first introduce a graph model to quantify the amount of similarity between each pair of DNA reads. We then present a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Our extensive analysis demonstrates the effectiveness of PolyCluster in accurate and scalable phasing. In particular, we show that PolyCluster reduces switching error of H-PoP, HapColor, and HapTree by 44.4, 51.2, and 48.3 percent, respectively. Also, the running time of PolyCluster is several orders-of-magnitude less than HapTree while it achieves a running time comparable to other algorithms.
Collapse
|
11
|
Abstract
Bioinformatics regularly poses new challenges to algorithm engineers and theoretical computer scientists. This work surveys recent developments of parameterized algorithms and complexity for important NP-hard problems in bioinformatics. We cover sequence assembly and analysis, genome comparison and completion, and haplotyping and phylogenetics. Aside from reporting the state of the art, we give challenges and open problems for each topic.
Collapse
|
12
|
Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol 2019; 20:116. [PMID: 31159868 PMCID: PMC6547545 DOI: 10.1186/s13059-019-1709-0] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 05/06/2019] [Indexed: 12/19/2022] Open
Abstract
Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.
Collapse
Affiliation(s)
- Jana Ebler
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.
| |
Collapse
|
13
|
Tangherloni A, Spolaor S, Rundo L, Nobile MS, Cazzaniga P, Mauri G, Liò P, Merelli I, Besozzi D. GenHap: a novel computational method based on genetic algorithms for haplotype assembly. BMC Bioinformatics 2019; 20:172. [PMID: 30999845 PMCID: PMC6471693 DOI: 10.1186/s12859-019-2691-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Background In order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications. Results To reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets. Conclusions Future-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.
Collapse
Affiliation(s)
- Andrea Tangherloni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Viale Sarca 336, U14 Building, Milan, 20126, Italy.
| | - Simone Spolaor
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Viale Sarca 336, U14 Building, Milan, 20126, Italy
| | - Leonardo Rundo
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Viale Sarca 336, U14 Building, Milan, 20126, Italy.,Institute of Molecular Bioimaging and Physiology, Italian National Research Council, Contrada Pietrapollastra-Pisciotto, Cefalù (PA), 90015, Italy
| | - Marco S Nobile
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Viale Sarca 336, U14 Building, Milan, 20126, Italy.,SYSBIO.IT Centre of Systems Biology, Piazza della Scienza 2, Milan, 20126, Italy
| | - Paolo Cazzaniga
- Department of Human and Social Sciences, University of Bergamo, Piazzale Sant'Agostino 2, Bergamo, 24129, Italy.,SYSBIO.IT Centre of Systems Biology, Piazza della Scienza 2, Milan, 20126, Italy
| | - Giancarlo Mauri
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Viale Sarca 336, U14 Building, Milan, 20126, Italy.,SYSBIO.IT Centre of Systems Biology, Piazza della Scienza 2, Milan, 20126, Italy
| | - Pietro Liò
- Computer Laboratory, University of Cambridge, 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK
| | - Ivan Merelli
- Institute of Biomedical Technologies, Italian National Research Council, Via Fratelli Cervi 93, Segrate (MI), 20090, Italy
| | - Daniela Besozzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Viale Sarca 336, U14 Building, Milan, 20126, Italy
| |
Collapse
|
14
|
Wee Y, Bhyan SB, Liu Y, Lu J, Li X, Zhao M. The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing. Brief Funct Genomics 2018; 18:1-12. [DOI: 10.1093/bfgp/ely037] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Revised: 10/03/2018] [Accepted: 10/19/2018] [Indexed: 02/06/2023] Open
Affiliation(s)
- YongKiat Wee
- School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland, Australia
| | - Salma Begum Bhyan
- School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland, Australia
| | - Yining Liu
- The School of Public Health, Institute for Chemical Carcinogenesis,Guangzhou Medical University, Dongfengxi Road, Guangzhou, China
| | - Jiachun Lu
- The School of Public Health, Institute for Chemical Carcinogenesis,Guangzhou Medical University, Dongfengxi Road, Guangzhou, China
- The School of Public Health, The First Affiliated Hospital, Guangzhou Medical University, Guangzhou, China
| | - Xiaoyan Li
- Beijing Anzhen Hospital, Capital Medical University, Beijing Institute of Heart, Lung & Blood Vessel Disease, Beijing, China
| | - Min Zhao
- School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland, Australia
| |
Collapse
|
15
|
Lin YY, Wu PC, Chen PL, Oyang YJ, Chen CY. HAHap: a read-based haplotyping method using hierarchical assembly. PeerJ 2018; 6:e5852. [PMID: 30397550 PMCID: PMC6214236 DOI: 10.7717/peerj.5852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 09/27/2018] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. RESULTS This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.
Collapse
Affiliation(s)
- Yu-Yu Lin
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Ping Chun Wu
- Taipei Blood Center, Taiwan Blood Services Foundation, Taipei, Taiwan
| | - Pei-Lung Chen
- Graduate Institute of Medical Genomics and Proteomics, College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Yen-Jen Oyang
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Chien-Yu Chen
- Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
16
|
Beretta S, Patterson MD, Zaccaria S, Della Vedova G, Bonizzoni P. HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads. BMC Bioinformatics 2018; 19:252. [PMID: 29970002 PMCID: PMC6029272 DOI: 10.1186/s12859-018-2253-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 06/18/2018] [Indexed: 01/08/2023] Open
Abstract
Background Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. Results Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. Conclusions Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. Availability HapCHAT is available at http://hapchat.algolab.euunder the GNU Public License (GPL).
Collapse
Affiliation(s)
- Stefano Beretta
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Murray D Patterson
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy.
| | - Simone Zaccaria
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
17
|
Abstract
Motivation Current technologies for single-cell DNA sequencing require whole-genome amplification (WGA), as a single cell contains too little DNA for direct sequencing. Unfortunately, WGA introduces biases in the resulting sequencing data, including non-uniformity in genome coverage and high rates of allele dropout. These biases complicate many downstream analyses, including the detection of genomic variants. Results We show that amplification biases have a potential upside: long-range correlations in rates of allele dropout provide a signal for phasing haplotypes at the lengths of amplicons from WGA, lengths which are generally longer than than individual sequence reads. We describe a statistical test to measure concurrent allele dropout between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. We use results of this test to perform haplotype assembly across a collection of single cells. We demonstrate that the algorithm predicts phasing between pairs of SNPs with higher accuracy than phasing from reads alone. Using whole-genome sequencing data from only seven neural cells, we obtain haplotype blocks that are orders of magnitude longer than with sequence reads alone (median length 10.2 kb versus 312 bp), with error rates <2%. We demonstrate similar advantages on whole-exome data from 16 cells, where we obtain haplotype blocks with median length 9.2 kb-comparable to typical gene lengths-compared with median lengths of 41 bp with sequence reads alone, with error rates <4%. Our algorithm will be useful for haplotyping of rare alleles and studies of allele-specific somatic aberrations. Availability and implementation Source code is available at https://www.github.com/raphael-group. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gryte Satas
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Department of Computer Science, Brown University, Providence, RI, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| |
Collapse
|
18
|
Abstract
BACKGROUND Haplotype assembly is the task of reconstructing haplotypes of an individual from a mixture of sequenced chromosome fragments. Haplotype information enables studies of the effects of genetic variations on an organism's phenotype. Most of the mathematical formulations of haplotype assembly are known to be NP-hard and haplotype assembly becomes even more challenging as the sequencing technology advances and the length of the paired-end reads and inserts increases. Assembly of haplotypes polyploid organisms is considerably more difficult than in the case of diploids. Hence, scalable and accurate schemes with provable performance are desired for haplotype assembly of both diploid and polyploid organisms. RESULTS We propose a framework that formulates haplotype assembly from sequencing data as a sparse tensor decomposition. We cast the problem as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and [Formula: see text]; tensor [Formula: see text] reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by iteratively solving this decomposition problem is proposed. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. The developed framework is applicable to diploid, biallelic and polyallelic polyploid species. The code for AltHap is freely available from https://github.com/realabolfazl/AltHap . CONCLUSION AltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.
Collapse
Affiliation(s)
- Abolfazl Hashemi
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| | - Banghua Zhu
- EE Department, Tsinghua University, Beijing, China
| | - Haris Vikalo
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
19
|
Guo F, Wang D, Wang L. Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data. Bioinformatics 2018; 34:2012-2018. [DOI: 10.1093/bioinformatics/bty059] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 02/17/2018] [Indexed: 12/30/2022] Open
Affiliation(s)
- Fei Guo
- School of Computer Science and Technology, Tianjin University, Tianjin Haihe Education Park, Tianjin, China
| | - Dan Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
- University of Hong Kong Shenzhen Research Institute, Shenzhen Hi-Tech Industrial Park, Shenzhen, Guangdong, China
| |
Collapse
|
20
|
Abstract
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
Collapse
|
21
|
Abstract
MOTIVATION Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information-reads and pedigree-has the potential to deliver results better than each individually. RESULTS We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual. AVAILABILITY AND IMPLEMENTATION https://bitbucket.org/whatshap/whatshap CONTACT t.marschall@mpi-inf.mpg.de.
Collapse
Affiliation(s)
- Shilpa Garg
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken, Germany Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Marcel Martin
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, SE-17121 Solna, Sweden
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
22
|
Ben-Elazar S, Chor B, Yakhini Z. Extending partial haplotypes to full genome haplotypes using chromosome conformation capture data. Bioinformatics 2017; 32:i559-i566. [PMID: 27587675 DOI: 10.1093/bioinformatics/btw453] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Complex interactions among alleles often drive differences in inherited properties including disease predisposition. Isolating the effects of these interactions requires phasing information that is difficult to measure or infer. Furthermore, prevalent sequencing technologies used in the essential first step of determining a haplotype limit the range of that step to the span of reads, namely hundreds of bases. With the advent of pseudo-long read technologies, observable partial haplotypes can span several orders of magnitude more. Yet, measuring whole-genome-single-individual haplotypes remains a challenge. A different view of whole genome measurement addresses the 3D structure of the genome-with great development of Hi-C techniques in recent years. A shortcoming of current Hi-C, however, is the difficulty in inferring information that is specific to each of a pair of homologous chromosomes. RESULTS In this work, we develop a robust algorithmic framework that takes two measurement derived datasets: raw Hi-C and partial short-range haplotypes, and constructs the full-genome haplotype as well as phased diploid Hi-C maps. By analyzing both data sets together we thus bridge important gaps in both technologies-from short to long haplotypes and from un-phased to phased Hi-C. We demonstrate that our method can recover ground truth haplotypes with high accuracy, using measured biological data as well as simulated data. We analyze the impact of noise, Hi-C sequencing depth and measured haplotype lengths on performance. Finally, we use the inferred 3D structure of a human genome to point at transcription factor targets nuclear co-localization. AVAILABILITY AND IMPLEMENTATION The implementation available at https://github.com/YakhiniGroup/SpectraPh CONTACT zohar.yakhini@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shay Ben-Elazar
- Department of Computer Science, Tel-Aviv University, Israel Microsoft R&D, HerzlyiaIsrael
| | - Benny Chor
- Department of Computer Science, Tel-Aviv University, Israel
| | - Zohar Yakhini
- Agilent Laboratories, Tel-Aviv, Israel Computer Science Department, Technion - Israel Institute of Technology, Haifa, Israel School of computer science, Herzeliya Interdisciplinary Center
| |
Collapse
|
23
|
Mardis ER. DNA sequencing technologies: 2006–2016. Nat Protoc 2017; 12:213-218. [DOI: 10.1038/nprot.2016.182] [Citation(s) in RCA: 211] [Impact Index Per Article: 30.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Accepted: 10/20/2016] [Indexed: 12/11/2022]
|
24
|
Bracciali A, Aldinucci M, Patterson M, Marschall T, Pisanti N, Merelli I, Torquati M. PWHATSHAP: efficient haplotyping for future generation sequencing. BMC Bioinformatics 2016; 17:342. [PMID: 28185544 PMCID: PMC5046197 DOI: 10.1186/s12859-016-1170-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology’s current trends that are producing longer fragments. Results Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage. Conclusions Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the efficiency of the analysis of genomics information.
Collapse
Affiliation(s)
- Andrea Bracciali
- Computer Science and Mathematics, School of Natural Sciences, Stirling University, Stirling, FK9 4LA, UK.
| | - Marco Aldinucci
- Department of Computer Science, University of Torino, Torino, Italy
| | - Murray Patterson
- Laboratoire de Biométrie et Biologie Evolutive, University Claude Bernard, Lyon, France
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland, Germany.,Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Nadia Pisanti
- Department of Computer Science, University of Pisa, Pisa, Italy.,Erable Team, INRIA, Grenoble, France
| | - Ivan Merelli
- Institute of Biomedical Technologies, National Research Council, Milan, Italy
| | | |
Collapse
|
25
|
Xie M, Wu Q, Wang J, Jiang T. H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics 2016; 32:3735-3744. [PMID: 27531103 DOI: 10.1093/bioinformatics/btw537] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2016] [Revised: 06/23/2016] [Accepted: 08/09/2016] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Some economically important plants including wheat and cotton have more than two copies of each chromosome. With the decreasing cost and increasing read length of next-generation sequencing technologies, reconstructing the multiple haplotypes of a polyploid genome from its sequence reads becomes practical. However, the computational challenge in polyploid haplotyping is much greater than that in diploid haplotyping, and there are few related methods. RESULTS This article models the polyploid haplotyping problem as an optimal poly-partition problem of the reads, called the Polyploid Balanced Optimal Partition model. For the reads sequenced from a k-ploid genome, the model tries to divide the reads into k groups such that the difference between the reads of the same group is minimized while the difference between the reads of different groups is maximized. When the genotype information is available, the model is extended to the Polyploid Balanced Optimal Partition with Genotype constraint problem. These models are all NP-hard. We propose two heuristic algorithms, H-PoP and H-PoPG, based on dynamic programming and a strategy of limiting the number of intermediate solutions at each iteration, to solve the two models, respectively. Extensive experimental results on simulated and real data show that our algorithms can solve the models effectively, and are much faster and more accurate than the recent state-of-the-art polyploid haplotyping algorithms. The experiments also show that our algorithms can deal with long reads and deep read coverage effectively and accurately. Furthermore, H-PoP might be applied to help determine the ploidy of an organism. AVAILABILITY AND IMPLEMENTATION https://github.com/MinzhuXie/H-PoPG CONTACT: xieminzhu@hotmail.comSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Minzhu Xie
- Key Laboratory of Internet of Things Technologies and Application, College of Physics and Information Science, Hunan Normal University, Changsha 410081, China
| | - Qiong Wu
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, Changsha 410083, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.,MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
26
|
Bonizzoni P, Dondi R, Klau GW, Pirola Y, Pisanti N, Zaccaria S. On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes. J Comput Biol 2016; 23:718-36. [PMID: 27280382 DOI: 10.1089/cmb.2015.0220] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.
Collapse
Affiliation(s)
- Paola Bonizzoni
- 1 Department of Computer Science (DISCO), University of Milano-Bicocca , Milan, Italy
| | - Riccardo Dondi
- 2 Department of Social and Human Sciences, University of Bergamo , Bergamo, Italy
| | - Gunnar W Klau
- 3 Life Sciences Group, Centrum Wiskunde & Informatica (CWI) , Amsterdam, The Netherlands .,4 ERABLE Team , INRIA, Lyon, France
| | - Yuri Pirola
- 1 Department of Computer Science (DISCO), University of Milano-Bicocca , Milan, Italy
| | - Nadia Pisanti
- 4 ERABLE Team , INRIA, Lyon, France .,5 Department of Computer Science, University of Pisa , Pisa, Italy
| | - Simone Zaccaria
- 1 Department of Computer Science (DISCO), University of Milano-Bicocca , Milan, Italy
| |
Collapse
|