1
|
Bai X, Chen Z, Chen K, Wu Z, Wang R, Liu J, Chang L, Wen L, Tang F. Simultaneous de novo calling and phasing of genetic variants at chromosome-scale using NanoStrand-seq. Cell Discov 2024; 10:74. [PMID: 38977679 PMCID: PMC11231365 DOI: 10.1038/s41421-024-00694-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 05/23/2024] [Indexed: 07/10/2024] Open
Abstract
The successful accomplishment of the first telomere-to-telomere human genome assembly, T2T-CHM13, marked a milestone in achieving completeness of the human reference genome. The upcoming era of genome study will focus on fully phased diploid genome assembly, with an emphasis on genetic differences between individual haplotypes. Most existing sequencing approaches only achieved localized haplotype phasing and relied on additional pedigree information for further whole-chromosome scale phasing. The short-read-based Strand-seq method is able to directly phase single nucleotide polymorphisms (SNPs) at whole-chromosome scale but falls short when it comes to phasing structural variations (SVs). To shed light on this issue, we developed a Nanopore sequencing platform-based Strand-seq approach, which we named NanoStrand-seq. This method allowed for de novo SNP calling with high precision (99.52%) and acheived a superior phasing accuracy (0.02% Hamming error rate) at whole-chromosome scale, a level of performance comparable to Strand-seq for haplotype phasing of the GM12878 genome. Importantly, we demonstrated that NanoStrand-seq can efficiently resolve the MHC locus, a highly polymorphic genomic region. Moreover, NanoStrand-seq enabled independent direct calling and phasing of deletions and insertions at whole-chromosome level; when applied to long genomic regions of SNP homozygosity, it outperformed the strategy that combined Strand-seq with bulk long-read sequencing. Finally, we showed that, like Strand-seq, NanoStrand-seq was also applicable to primary cultured cells. Together, here we provided a novel methodology that enabled interrogation of a full spectrum of haplotype-resolved SNPs and SVs at whole-chromosome scale, with broad applications for species with diploid or even potentially polypoid genomes.
Collapse
Affiliation(s)
- Xiuzhen Bai
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China
- Changping Laboratory, Beijing, China
| | - Zonggui Chen
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Changping Laboratory, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Kexuan Chen
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- School of Life Sciences, Peking University, Beijing, China
| | - Zixin Wu
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Rui Wang
- Department of Medicine, Cancer Institute, Stanford University, Stanford, CA, USA
| | - Jun'e Liu
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China
- Changping Laboratory, Beijing, China
- School of Life Sciences, Peking University, Beijing, China
| | - Liang Chang
- State Key Laboratory of Female Fertility Promotion, Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, China
- National Clinical Research Center for Obstetrics and Gynecology (Peking University Third Hospital), Beijing, China
- Key Laboratory of Assisted Reproduction (Peking University), Ministry of Education Beijing, Beijing, China
- Key Laboratory of Reproductive Endocrinology and Assisted Reproductive Technology, Beijing, China
| | - Lu Wen
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China
- Changping Laboratory, Beijing, China
| | - Fuchou Tang
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China.
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China.
- Changping Laboratory, Beijing, China.
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China.
- School of Life Sciences, Peking University, Beijing, China.
| |
Collapse
|
2
|
Henglin M, Ghareghani M, Harvey W, Porubsky D, Koren S, Eichler EE, Ebert P, Marschall T. Phasing Diploid Genome Assembly Graphs with Single-Cell Strand Sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.15.580432. [PMID: 38529499 PMCID: PMC10962706 DOI: 10.1101/2024.02.15.580432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de-novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de-novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio-phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.
Collapse
Affiliation(s)
- Mir Henglin
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
| | - Maryam Ghareghani
- Department of Mathematics and Computer Science, Freie Universität Berlin, Germany
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - William Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
| |
Collapse
|
3
|
Irastorza-Azcarate I, Kukalev A, Kempfer R, Thieme CJ, Mastrobuoni G, Markowski J, Loof G, Sparks TM, Brookes E, Natarajan KN, Sauer S, Fisher AG, Nicodemi M, Ren B, Schwarz RF, Kempa S, Pombo A. Extensive folding variability between homologous chromosomes in mammalian cells. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.08.591087. [PMID: 38766012 PMCID: PMC11100664 DOI: 10.1101/2024.05.08.591087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Genetic variation and 3D chromatin structure have major roles in gene regulation. Due to challenges in mapping chromatin conformation with haplotype-specific resolution, the effects of genetic sequence variation on 3D genome structure and gene expression imbalance remain understudied. Here, we applied Genome Architecture Mapping (GAM) to a hybrid mouse embryonic stem cell (mESC) line with high density of single nucleotide polymorphisms (SNPs). GAM resolved haplotype-specific 3D genome structures with high sensitivity, revealing extensive allelic differences in chromatin compartments, topologically associating domains (TADs), long-range enhancer-promoter contacts, and CTCF loops. Architectural differences often coincide with allele-specific differences in gene expression, mediated by Polycomb repression. We show that histone genes are expressed with allelic imbalance in mESCs, are involved in haplotype-specific chromatin contact marked by H3K27me3, and are targets of Polycomb repression through conditional knockouts of Ezh2 or Ring1b. Our work reveals highly distinct 3D folding structures between homologous chromosomes, and highlights their intricate connections with allelic gene expression.
Collapse
|
4
|
Ashraf H, Ebler J, Marschall T. Allele detection using k-mer-based sequencing error profiles. BIOINFORMATICS ADVANCES 2023; 3:vbad149. [PMID: 37928341 PMCID: PMC10625474 DOI: 10.1093/bioadv/vbad149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 09/21/2023] [Accepted: 10/19/2023] [Indexed: 11/07/2023]
Abstract
Motivation For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. Results To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3× coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. Availability and implementation https://github.com/whatshap/whatshap.
Collapse
Affiliation(s)
- Hufsah Ashraf
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| |
Collapse
|
5
|
Ouchi S, Kajitani R, Itoh T. GreenHill: a de novo chromosome-level scaffolding and phasing tool using Hi-C. Genome Biol 2023; 24:162. [PMID: 37434204 DOI: 10.1186/s13059-023-03006-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 07/04/2023] [Indexed: 07/13/2023] Open
Abstract
Chromosome-level haplotype-resolved genome assembly is an important resource in molecular biology. However, current de novo haplotype assemblers require parental data or reference genomes and often fail to provide chromosome-level results. We present GreenHill, a novel scaffolding and phasing tool that considers various assemblers' contigs as input to reconstruct chromosome-level haplotypes using Hi-C without parental or reference data. Its unique functions include new error correction based on Hi-C contacts and the simultaneous use of Hi-C and long reads. Benchmarks reveal that GreenHill outperforms other approaches in contiguity and phasing accuracy, and the majority of chromosome arms are entirely phased.
Collapse
Affiliation(s)
- Shun Ouchi
- School of Life Science and Technology, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-Ku, Tokyo, 152-8550, Japan
| | - Rei Kajitani
- School of Life Science and Technology, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-Ku, Tokyo, 152-8550, Japan
| | - Takehiko Itoh
- School of Life Science and Technology, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-Ku, Tokyo, 152-8550, Japan.
| |
Collapse
|
6
|
Fernández Álvarez J, Navas González FJ, León Jurado JM, González Ariza A, Martínez Martínez MA, Pastrana CI, Pizarro Inostroza MG, Delgado Bermejo JV. Discriminant canonical tool for inferring the effect of αS1, αS2, β, and κ casein haplotypes and haplogroups on zoometric/linear appraisal breeding values in Murciano-Granadina goats. Front Vet Sci 2023; 10:1138528. [PMID: 37483293 PMCID: PMC10360128 DOI: 10.3389/fvets.2023.1138528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 06/19/2023] [Indexed: 07/25/2023] Open
Abstract
Genomic tools have shown promising results in maximizing breeding outcomes, but their impact has not yet been explored. This study aimed to outline the effect of the individual haplotypes of each component of the casein complex (αS1, β, αS2, and κ-casein) on zoometric/linear appraisal breeding values. A discriminant canonical analysis was performed to study the relationship between the predicted breeding value for 17 zoometric/linear appraisal traits and the aforementioned casein gene haplotypic sequences. The analysis considered a total of 41,323 zoometric/linear appraisal records from 22,727 primiparous does, 17,111 multiparous does, and 1,485 bucks registered in the Murciano-Grandina goat breed herdbook. Results suggest that, although a lack of significant differences (p > 0.05) was reported across the predictive breeding values of zoometric/linear appraisal traits for αS1, αS2, and κ casein, significant differences were found for β casein (p < 0.05). The presence of β casein haplotypic sequences GAGACCCC, GGAACCCC, GGAACCTC, GGAATCTC, GGGACCCC, GGGATCTC, and GGGGCCCC, linked to differential combinations of increased quantities of higher quality milk in terms of its composition, may also be connected to increased zoometric/linear appraisal predicted breeding values. Selection must be performed carefully, given the fact that the consideration of apparently desirable animals that present the haplotypic sequence GGGATCCC in the β casein gene, due to their positive predicted breeding values for certain zoometric/linear appraisal traits such as rear insertion height, bone quality, anterior insertion, udder depth, rear legs side view, and rear legs rear view, may lead to an indirect selection against the other zoometric/linear appraisal traits and in turn lead to an inefficient selection toward an optimal dairy morphological type in Murciano-Granadina goats. Contrastingly, the consideration of animals presenting the GGAACCCC haplotypic sequence involves also considering animals that increase the genetic potential for all zoometric/linear appraisal traits, thus making them recommendable as breeding animals. The relevance of this study relies on the fact that the information derived from these analyses will enhance the selection of breeding individuals, in which a desirable dairy type is indirectly sought, through the haplotypic sequences in the β casein locus, which is not currently routinely considered in the Murciano-Granadina goat breeding program.
Collapse
Affiliation(s)
| | | | - José M. León Jurado
- Agropecuary Provincial Centre, Diputación Provincial de Córdoba, Córdoba, Spain
| | - Antonio González Ariza
- Department of Genetics, University of Córdoba, Córdoba, Spain
- Agropecuary Provincial Centre, Diputación Provincial de Córdoba, Córdoba, Spain
| | | | | | - María G. Pizarro Inostroza
- Department of Genetics, University of Córdoba, Córdoba, Spain
- Animal Breeding Consulting, S.L., Córdoba Science and Technology Park Rabanales, Córdoba, Spain
| | | |
Collapse
|
7
|
Zhou Y, Leung AWS, Ahmed SS, Lam TW, Luo R. Duet: SNP-assisted structural variant calling and phasing using Oxford nanopore sequencing. BMC Bioinformatics 2022; 23:465. [PMCID: PMC9639287 DOI: 10.1186/s12859-022-05025-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 10/29/2022] [Indexed: 11/09/2022] Open
Abstract
Background Whole genome sequencing using the long-read Oxford Nanopore Technologies (ONT) MinION sequencer provides a cost-effective option for structural variant (SV) detection in clinical applications. Despite the advantage of using long reads, however, accurate SV calling and phasing are still challenging.
Results We introduce Duet, an SV detection tool optimized for SV calling and phasing using ONT data. The tool uses novel features integrated from both SV signatures and single-nucleotide polymorphism signatures, which can accurately distinguish SV haplotype from a false signal. Duet was benchmarked against state-of-the-art tools on multiple ONT sequencing datasets of sequencing coverage ranging from 8× to 40×. At low sequencing coverage of 8×, Duet performs better than all other tools in SV calling, SV genotyping and SV phasing. When the sequencing coverage is higher (20× to 40×), the F1-score for SV phasing is further improved in comparison to the performance of other tools, while its performance of SV genotyping and SV calling remains higher than other tools. Conclusion Duet can perform accurate SV calling, SV genotyping and SV phasing using low-coverage ONT data, making it very useful for low-coverage genomes. It has great performance when scaled to high-coverage genomes, which is adaptable to various clinical applications. Duet is open source and is available at https://github.com/yekaizhou/duet.
Collapse
Affiliation(s)
- Yekai Zhou
- grid.194645.b0000000121742757Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Amy Wing-Sze Leung
- grid.194645.b0000000121742757Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Syed Shakeel Ahmed
- grid.194645.b0000000121742757Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- grid.194645.b0000000121742757Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ruibang Luo
- grid.194645.b0000000121742757Department of Computer Science, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
8
|
The Contribution of JAK2 46/1 Haplotype in the Predisposition to Myeloproliferative Neoplasms. Int J Mol Sci 2022; 23:ijms232012582. [PMID: 36293440 PMCID: PMC9604447 DOI: 10.3390/ijms232012582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/13/2022] [Accepted: 10/15/2022] [Indexed: 11/17/2022] Open
Abstract
Haplotype 46/1 (GGCC) consists of a set of genetic variations distributed along chromosome 9p.24.1, which extend from the Janus Kinase 2 gene to Insulin like 4. Marked by four jointly inherited variants (rs3780367, rs10974944, rs12343867, and rs1159782), this haplotype has a strong association with the development of BCR-ABL1-negative myeloproliferative neoplasms (MPNs) because it precedes the acquisition of the JAK2V617F variant, a common genetic alteration in individuals with these hematological malignancies. It is also described as one of the factors that increases the risk of familial MPNs by more than five times, 46/1 is associated with events related to inflammatory dysregulation, splenomegaly, splanchnic vein thrombosis, Budd–Chiari syndrome, increases in RBC count, platelets, leukocytes, hematocrit, and hemoglobin, which are characteristic of MPNs, as well as other findings that are still being elucidated and which are of great interest for the etiopathological understanding of these hematological neoplasms. Considering these factors, the present review aims to describe the main findings and discussions involving the 46/1 haplotype, and highlights the molecular and immunological aspects and their relevance as a tool for clinical practice and investigation of familial cases.
Collapse
|
9
|
Zhang T, Zhou J, Gao W, Jia Y, Wei Y, Wang G. Complex genome assembly based on long-read sequencing. Brief Bioinform 2022; 23:6657663. [PMID: 35940845 DOI: 10.1093/bib/bbac305] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/20/2022] [Accepted: 07/06/2022] [Indexed: 11/12/2022] Open
Abstract
High-quality genome chromosome-scale sequences provide an important basis for genomics downstream analysis, especially the construction of haplotype-resolved and complete genomes, which plays a key role in genome annotation, mutation detection, evolutionary analysis, gene function research, comparative genomics and other aspects. However, genome-wide short-read sequencing is difficult to produce a complete genome in the face of a complex genome with high duplication and multiple heterozygosity. The emergence of long-read sequencing technology has greatly improved the integrity of complex genome assembly. We review a variety of computational methods for complex genome assembly and describe in detail the theories, innovations and shortcomings of collapsed, semi-collapsed and uncollapsed assemblers based on long reads. Among the three methods, uncollapsed assembly is the most correct and complete way to represent genomes. In addition, genome assembly is closely related to haplotype reconstruction, that is uncollapsed assembly realizes haplotype reconstruction, and haplotype reconstruction promotes uncollapsed assembly. We hope that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yanan Wei
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
10
|
Robinson M, Joshi A, Vidyarthi A, Maccoun M, Rangavajjhala S, Glusman G. Quality control of large genome datasets. HGG ADVANCES 2022; 3:100123. [PMID: 35789587 PMCID: PMC9250042 DOI: 10.1016/j.xhgg.2022.100123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 06/02/2022] [Indexed: 11/26/2022] Open
Abstract
The 1000 Genomes Project (TGP) is a foundational resource that serves the biomedical community as a standard reference cohort for human genetic variation. There are now seven public versions of these genomes. The TGP Consortium produced the first by mapping its final data release against human reference sequence GRCh37, then “lifted over” these genomes to the improved reference sequence (GRCh38) when it was released, and remapped the original data to GRCh38 with two similar pipelines. As best-practice quality validation, the pipelines that generated these versions were benchmarked against the Genome In A Bottle Consortium’s “platinum quality” genome (NA12878). The New York Genome Center recently released the results of independently resequencing the cohort at greater depth (30×), a phased version informed by the inclusion of related individuals, and independently remapped the original variant calls to GRCh38. We performed a cross-comparison evaluation of all seven versions using genome fingerprinting, which supports ultrafast genome comparison even across reference versions. We noted multiple issues, including discrepancies in cohort membership, disagreement on the overall level of variation, evidence of substandard pipeline performance on specific genomes and in specific regions of the genome, cryptic relationships between individuals, inconsistent phasing, and annotation distortions caused by the history of the reference genome itself. We therefore recommend global quality assessment by rapid genome comparisons, alongside benchmarking as part of best-practice quality assessment of large genome datasets. Our observations also help inform the decision of which version to use, to support analyses by individual researchers.
Collapse
|
11
|
Wakita S, Hara M, Kitabatake Y, Kawatani K, Kurahashi H, Hashizume R. Experimental method for haplotype phasing across the entire length of chromosome 21 in trisomy 21 cells using a chromosome elimination technique. J Hum Genet 2022; 67:565-572. [PMID: 35637312 PMCID: PMC9510051 DOI: 10.1038/s10038-022-01049-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 04/25/2022] [Accepted: 05/12/2022] [Indexed: 11/09/2022]
Abstract
Modern sequencing technologies produce a single consensus sequence without distinguishing between homologous chromosomes. Haplotype phasing solves this limitation by identifying alleles on the maternal and paternal chromosomes. This information is critical for understanding gene expression models in genetic disease research. Furthermore, the haplotype phasing of three homologous chromosomes in trisomy cells is more complicated than that in disomy cells. In this study, we attempted the accurate and complete haplotype phasing of chromosome 21 in trisomy 21 cells. To separate homologs, we established three corrected disomy cell lines (ΔPaternal chromosome, ΔMaternal chromosome 1, and ΔMaternal chromosome 2) from trisomy 21 induced pluripotent stem cells by eliminating one chromosome 21 utilizing the Cre-loxP system. These cells were then whole-genome sequenced by a next-generation sequencer. By simply comparing the base information of the whole-genome sequence data at the same position between each corrected disomy cell line, we determined the base on the eliminated chromosome and performed phasing. We phased 51,596 single nucleotide polymorphisms (SNPs) on chromosome 21, randomly selected seven SNPs spanning the entire length of the chromosome, and confirmed that there was no contradiction by direct sequencing.
Collapse
Affiliation(s)
- Sachiko Wakita
- Department of Pathology and Matrix Biology, Mie University Graduate School of Medicine, Mie, Japan
| | - Mari Hara
- Department of Pathology and Matrix Biology, Mie University Graduate School of Medicine, Mie, Japan
| | - Yasuji Kitabatake
- Department of Pediatrics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Keiji Kawatani
- Department of Pediatrics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan.,Department of Neuroscience, Mayo Clinic, Scottsdale, AZ, USA
| | - Hiroki Kurahashi
- Division of Molecular Genetics, Institute for Comprehensive Medical Science, Fujita Health University, Toyoake, Japan
| | - Ryotaro Hashizume
- Department of Pathology and Matrix Biology, Mie University Graduate School of Medicine, Mie, Japan. .,Department of Genomic Medicine, Mie University Hospital, Mie, Japan.
| |
Collapse
|
12
|
Yu Y, Chen L, Miao X, Li SC. SpecHap: a diploid phasing algorithm based on spectral graph theory. Nucleic Acids Res 2021; 49:e114. [PMID: 34403470 PMCID: PMC8565328 DOI: 10.1093/nar/gkab709] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 07/25/2021] [Accepted: 08/02/2021] [Indexed: 11/30/2022] Open
Abstract
Haplotype phasing plays an important role in understanding the genetic data of diploid eukaryotic organisms. Different sequencing technologies (such as next-generation sequencing or third-generation sequencing) produce various genetic data that require haplotype assembly. Although multiple diploid haplotype phasing algorithms exist, only a few will work equally well across all sequencing technologies. In this work, we propose SpecHap, a novel haplotype assembly tool that leverages spectral graph theory. On both in silico and whole-genome sequencing datasets, SpecHap consumed less memory and required less CPU time, yet achieved comparable accuracy with state-of-art methods across all the test instances, which comprises sequencing data from next-generation sequencing, linked-reads, high-throughput chromosome conformation capture, PacBio single-molecule real-time, and Oxford Nanopore long-reads. Furthermore, SpecHap successfully phased an individual Ambystoma mexicanum, a species with gigantic diploid genomes, within 6 CPU hours and 945MB peak memory usage, while other tools failed to yield results either due to memory overflow (40GB) or time limit exceeded (5 days). Our results demonstrated that SpecHap is scalable, efficient, and accurate for diploid phasing across many sequencing platforms.
Collapse
Affiliation(s)
- Yonghan Yu
- Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077, China
| | - Lingxi Chen
- Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077, China
| | - Xinyao Miao
- Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077, China
| | - Shuai Cheng Li
- Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077, China
| |
Collapse
|
13
|
Ahmed AA, Ad'hiah AH. Interleukin-37 gene polymorphism and susceptibility to coronavirus disease 19 among Iraqi patients. Meta Gene 2021; 31:100989. [PMID: 34729360 PMCID: PMC8553418 DOI: 10.1016/j.mgene.2021.100989] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 10/14/2021] [Accepted: 10/21/2021] [Indexed: 12/12/2022] Open
Abstract
Coronavirus disease 19 (COVID-19) is a highly contagious respiratory viral infection. Dysregulated immune response is an important feature of disease, and cytokines are among the most important mediators of dysregulated immunity. Interleukin-37 (IL-37) is one such cytokine and studies have indicated its role in pathogenesis of COVID-19. However, IL37 gene polymorphisms have not been identified in patients with COVID-19. Therefore, this case-control study (100 patients and 100 controls) was performed to understand the role six single nucleotide polymorphisms of IL37 gene (SNPs: rs3811042, rs3811043, rs2466449, rs3811045, rs3811046 and rs3811047) in susceptibility to COVID-19 among cases with severe disease. These polymorphisms were identified by Sanger DNA sequencing. Results revealed that TG genotype of rs3811046 showed a significantly increased frequency in patients compared to controls (61.0 vs. 38.0%; odds ratio [OR] = 2.55; 95% confidence interval [CI] = 1.45–4.50; probability [p] = 0.002; corrected p [pc] = 0.01). GA genotype of rs3811047 also showed an increased frequency in patients but the pc-value was not significant (39.0 vs. 24.0%; OR = 2.02; 95% CI = 1.10–3.71; p = 0.033; pc = 0.165). Haplotype analysis revealed a significantly increased frequency of the haplotype G-C-A-T-T-A (in the order: rs3811042, rs3811043, rs2466449, rs3811045, rs3811046 and rs3811047) in COVID-19 patients compared to controls (0.055 vs. 0.006; OR = 10.23; 95% CI = 1.53–68.14; p = 0.003; pc = 0.03). In conclusion, the study indicated that two variants of IL37 gene (rs3811046 and rs3811047) may be associated with susceptibility to COVID-19 among Iraqi population.
Collapse
Affiliation(s)
- Aeshah A Ahmed
- Biotechnology Department, College of Science, University of Baghdad, Baghdad, Iraq
| | - Ali H Ad'hiah
- Tropical-Biological Research Unit, College of Science, University of Baghdad, Baghdad, Iraq
| |
Collapse
|
14
|
Shafin K, Pesout T, Chang PC, Nattestad M, Kolesnikov A, Goel S, Baid G, Kolmogorov M, Eizenga JM, Miga KH, Carnevali P, Jain M, Carroll A, Paten B. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat Methods 2021; 18:1322-1332. [PMID: 34725481 PMCID: PMC8571015 DOI: 10.1038/s41592-021-01299-w] [Citation(s) in RCA: 106] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 09/06/2021] [Indexed: 01/15/2023]
Abstract
Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).
Collapse
Affiliation(s)
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | | | | | | | | | | | | | - Karen H Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Miten Jain
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | | |
Collapse
|
15
|
Luo X, Kang X, Schönhuth A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol 2021; 22:299. [PMID: 34706745 PMCID: PMC8549298 DOI: 10.1186/s13059-021-02512-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 10/05/2021] [Indexed: 01/27/2023] Open
Abstract
Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly. However, current long-read assemblers are either reference based, so introduce biases, or fail to capture the haplotype diversity of diploid genomes. We present phasebook, a de novo approach for reconstructing the haplotypes of diploid genomes from long reads. phasebook outperforms other approaches in terms of haplotype coverage by large margins, in addition to achieving competitive performance in terms of assembly errors and assembly contiguity.
Collapse
Affiliation(s)
- Xiao Luo
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Xiongbin Kang
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Alexander Schönhuth
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands.
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
16
|
Srivastava K, Fratzscher AS, Lan B, Flegel WA. Cataloguing experimentally confirmed 80.7 kb-long ACKR1 haplotypes from the 1000 Genomes Project database. BMC Bioinformatics 2021; 22:273. [PMID: 34039276 PMCID: PMC8150616 DOI: 10.1186/s12859-021-04169-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 05/04/2021] [Indexed: 12/18/2022] Open
Abstract
Background Clinically effective and safe genotyping relies on correct reference sequences, often represented by haplotypes. The 1000 Genomes Project recorded individual genotypes across 26 different populations and, using computerized genotype phasing, reported haplotype data. In contrast, we identified long reference sequences by analyzing the homozygous genomic regions in this online database, a concept that has rarely been reported since next generation sequencing data became available. Study design and methods Phased genotype data for a 80.6 kb region of chromosome 1 was downloaded for all 2,504 unrelated individuals of the 1000 Genome Project Phase 3 cohort. The data was centered on the ACKR1 gene and bordered by the CADM3 and FCER1A genes. Individuals with heterozygosity at a single site or with complete homozygosity allowed unambiguous assignment of an ACKR1 haplotype. A computer algorithm was developed for extracting these haplotypes from the 1000 Genome Project in an automated fashion. A manual analysis validated the data extracted by the algorithm. Results We confirmed 902 ACKR1 haplotypes of varying lengths, the longest at 80,584 nucleotides and shortest at 1,901 nucleotides. The combined length of haplotype sequences comprised 19,895,388 nucleotides with a median of 16,014 nucleotides. Based on our approach, all haplotypes can be considered experimentally confirmed and not affected by the known errors of computerized genotype phasing. Conclusions Tracts of homozygosity can provide definitive reference sequences for any gene. They are particularly useful when observed in unrelated individuals of large scale sequence databases. As a proof of principle, we explored the 1000 Genomes Project database for ACKR1 gene data and mined long haplotypes. These haplotypes are useful for high throughput analysis with next generation sequencing. Our approach is scalable, using automated bioinformatics tools, and can be applied to any gene. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04169-6.
Collapse
Affiliation(s)
- Kshitij Srivastava
- Laboratory Services Section, Department of Transfusion Medicine, NIH Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anne-Sophie Fratzscher
- Laboratory Services Section, Department of Transfusion Medicine, NIH Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Bo Lan
- Laboratory Services Section, Department of Transfusion Medicine, NIH Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Willy Albert Flegel
- Laboratory Services Section, Department of Transfusion Medicine, NIH Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
17
|
Li R, Qu H, Chen J, Wang S, Chater JM, Zhang L, Wei J, Zhang YM, Xu C, Zhong WD, Zhu J, Lu J, Feng Y, Chen W, Ma R, Ferrante SP, Roose ML, Jia Z. Inference of Chromosome-Length Haplotypes Using Genomic Data of Three or a Few More Single Gametes. Mol Biol Evol 2021; 37:3684-3698. [PMID: 32668004 PMCID: PMC7743722 DOI: 10.1093/molbev/msaa176] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Compared with genomic data of individual markers, haplotype data provide higher resolution for DNA variants, advancing our knowledge in genetics and evolution. Although many computational and experimental phasing methods have been developed for analyzing diploid genomes, it remains challenging to reconstruct chromosome-scale haplotypes at low cost, which constrains the utility of this valuable genetic resource. Gamete cells, the natural packaging of haploid complements, are ideal materials for phasing entire chromosomes because the majority of the haplotypic allele combinations has been preserved. Therefore, compared with the current diploid-based phasing methods, using haploid genomic data of single gametes may substantially reduce the complexity in inferring the donor’s chromosomal haplotypes. In this study, we developed the first easy-to-use R package, Hapi, for inferring chromosome-length haplotypes of individual diploid genomes with only a few gametes. Hapi outperformed other phasing methods when analyzing both simulated and real single gamete cell sequencing data sets. The results also suggested that chromosome-scale haplotypes may be inferred by using as few as three gametes, which has pushed the boundary to its possible limit. The single gamete cell sequencing technology allied with the cost-effective Hapi method will make large-scale haplotype-based genetic studies feasible and affordable, promoting the use of haplotype data in a wide range of research.
Collapse
Affiliation(s)
- Ruidong Li
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Graduate Program in Genetics, Genomics, and Bioinformatics, University of California, Riverside, Riverside, CA
| | - Han Qu
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA
| | - Jinfeng Chen
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA
| | - Shibo Wang
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA
| | - John M Chater
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA
| | - Le Zhang
- Graduate Program in Genetics, Genomics, and Bioinformatics, University of California, Riverside, Riverside, CA
| | - Julong Wei
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI
| | - Yuan-Ming Zhang
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Chenwu Xu
- Jiangsu Provincial Key Laboratory of Crop Genetics and Physiology, Co-Innovation Center for Modern Production Technology of Grain Crops, Key Laboratory of Plant Functional Genomics of Ministry of Education, Yangzhou University, Yangzhou, China
| | - Wei-De Zhong
- Department of Urology, Guangdong Key Laboratory of Clinical Molecular Medicine and Diagnostics, Guangzhou First People's Hospital, School of Medicine, South China University of Technology, Guangzhou, China
| | - Jianguo Zhu
- Department of Urology, Guizhou Provincial People's Hospital, Guizhou, China
| | - Jianming Lu
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Department of Urology, Guangdong Key Laboratory of Clinical Molecular Medicine and Diagnostics, Guangzhou First People's Hospital, School of Medicine, South China University of Technology, Guangzhou, China
| | - Yuanfa Feng
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Department of Urology, Guangdong Key Laboratory of Clinical Molecular Medicine and Diagnostics, Guangzhou First People's Hospital, School of Medicine, South China University of Technology, Guangzhou, China
| | - Weiming Chen
- Department of Urology, Guizhou Provincial People's Hospital, Guizhou, China
| | - Renyuan Ma
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Department of Mathematics, Bowdoin College, Brunswick, ME
| | - Sergio Pietro Ferrante
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA
| | - Mikeal L Roose
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Graduate Program in Genetics, Genomics, and Bioinformatics, University of California, Riverside, Riverside, CA
| | - Zhenyu Jia
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA.,Graduate Program in Genetics, Genomics, and Bioinformatics, University of California, Riverside, Riverside, CA
| |
Collapse
|
18
|
Garg S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol 2021; 22:101. [PMID: 33845884 PMCID: PMC8040228 DOI: 10.1186/s13059-021-02328-9] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 03/25/2021] [Indexed: 12/13/2022] Open
Abstract
High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.
Collapse
Affiliation(s)
- Shilpa Garg
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
19
|
A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model. PLoS One 2020; 15:e0241291. [PMID: 33120403 PMCID: PMC7595403 DOI: 10.1371/journal.pone.0241291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Accepted: 10/12/2020] [Indexed: 12/30/2022] Open
Abstract
Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. Although various methods have been developed to reconstruct haplotypes in diploid form, their accuracy is still a challenging task. Also, most of the current methods cannot be applied to polyploid form. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was randomly generated as an initial estimate, and its consistency with the input fragments was described by constructing a weighted hypergraph. Partitioning the hypergraph specifies those positions in the haplotype set that need to be corrected. This procedure is repeated until no further improvement could be achieved. Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points. Then, some positions with low qualities can be assessed by applying a local projection. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.
Collapse
|
20
|
Inostroza MGP, González FJN, Landi V, Jurado JML, Bermejo JVD, Fernández Álvarez J, Martínez Martínez MDA. Bayesian Analysis of the Association between Casein Complex Haplotype Variants and Milk Yield, Composition, and Curve Shape Parameters in Murciano-Granadina Goats. Animals (Basel) 2020; 10:E1845. [PMID: 33050522 PMCID: PMC7600415 DOI: 10.3390/ani10101845] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 10/08/2020] [Accepted: 10/08/2020] [Indexed: 01/05/2023] Open
Abstract
Considering casein haplotype variants rather than SNPs may maximize the understanding of heritable mechanisms and their implication on the expression of functional traits related to milk production. Effects of casein complex haplotypes on milk yield, milk composition, and curve shape parameters were used using a Bayesian inference for ANOVA. We identified 48 single nucleotide polymorphisms (SNPs) present in the casein complex of 159 unrelated individuals of diverse ancestry, which were organized into 86 haplotypes. The Ali and Schaeffer model was chosen as the best fitting model for milk yield (Kg), protein, fat, dry matter, and lactose (%), while parabolic yield-density was chosen as the best fitting model for somatic cells count (SCC × 103 sc/mL). Peak and persistence for all traits were computed respectively. Statistically significant differences (p < 0.05) were found for milk yield and components. However, no significant difference was found for any curve shape parameter except for protein percentage peak. Those haplotypes for which higher milk yields were reported were the ones that had higher percentages for protein, fat, dry matter, and lactose, while the opposite trend was described by somatic cells counts. Conclusively, casein complex haplotypes can be considered in selection strategies for economically important traits in dairy goats.
Collapse
Affiliation(s)
- María Gabriela Pizarro Inostroza
- Department of Genetics, Faculty of Veterinary Sciences, University of Córdoba, 14071 Córdoba, Spain; (M.G.P.I.); (J.V.D.B.); (M.d.A.M.M.)
- Animal Breeding Consulting, S.L., Córdoba Science and Technology Park Rabanales 21, 14071 Córdoba, Spain
| | - Francisco Javier Navas González
- Department of Genetics, Faculty of Veterinary Sciences, University of Córdoba, 14071 Córdoba, Spain; (M.G.P.I.); (J.V.D.B.); (M.d.A.M.M.)
| | - Vincenzo Landi
- Department of Veterinary Medicine, University of Bari “Aldo Moro”, 70010 Valenzano, Italy;
| | - Jose Manuel León Jurado
- Centro Agropecuario Provincial de Córdoba, Diputación Provincial de Córdoba, Córdoba, 14071 Córdoba, Spain;
| | - Juan Vicente Delgado Bermejo
- Department of Genetics, Faculty of Veterinary Sciences, University of Córdoba, 14071 Córdoba, Spain; (M.G.P.I.); (J.V.D.B.); (M.d.A.M.M.)
| | - Javier Fernández Álvarez
- National Association of Breeders of Murciano-Granadina Goat Breed, Fuente Vaqueros, 18340 Granada, Spain;
| | - María del Amparo Martínez Martínez
- Department of Genetics, Faculty of Veterinary Sciences, University of Córdoba, 14071 Córdoba, Spain; (M.G.P.I.); (J.V.D.B.); (M.d.A.M.M.)
| |
Collapse
|
21
|
Baaijens JA, Schönhuth A. Overlap graph-based generation of haplotigs for diploids and polyploids. Bioinformatics 2020; 35:4281-4289. [PMID: 30994902 DOI: 10.1093/bioinformatics/btz255] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 03/18/2019] [Accepted: 04/11/2019] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. RESULTS We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. AVAILABILITY AND IMPLEMENTATION POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Alexander Schönhuth
- Centrum Wiskunde & Informatica, XG Amsterdam, The Netherlands.,Theoretical Biology and Bioinformatics, Utrecht University, CH Utrecht, The Netherlands
| |
Collapse
|
22
|
Noninvasive prenatal diagnosis of hemophilia A by a haplotype-based approach using cell-free fetal DNA. Biotechniques 2020; 68:117-121. [PMID: 31996009 DOI: 10.2144/btn-2019-0113] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Aim: We aimed to demonstrate noninvasive prenatal diagnosis (NIPD) of hemophilia A (HA) using a haplotype-based approach. Methods: Two families at risk for HA were recruited for this study. First, maternal haplotypes associated with pathogenic variants were constructed using the genotypes of the mothers and probands. Then, fetal haplotypes were deduced using a maternal haplotype-assisted hidden Markov model. Finally, the NIPD results were further confirmed by invasive prenatal diagnosis. Results: Two fetal genotypes were successfully inferred, with one normal fetus and one carrier fetus. The NIPD results were confirmed by invasive prenatal diagnosis, with a 100% consistency rate. Conclusion: Our test has been shown to be accurate and reliable. With further validation in a large patient cohort, this haplotype-based approach could be feasible for the NIPD of HA and other X-linked single-gene disorders.
Collapse
|
23
|
Ando A, Imaeda N, Matsubara T, Takasu M, Miyamoto A, Oshima S, Nishii N, Kametani Y, Shiina T, Kulski JK, Kitagawa H. Genetic Association between Swine Leukocyte Antigen Class II Haplotypes and Reproduction Traits in Microminipigs. Cells 2019; 8:cells8080783. [PMID: 31357541 PMCID: PMC6721486 DOI: 10.3390/cells8080783] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 07/16/2019] [Accepted: 07/22/2019] [Indexed: 02/06/2023] Open
Abstract
The effects of swine leukocyte antigen (SLA) molecules on numerous production and reproduction performance traits have been mainly reported as associations with specific SLA haplotypes that were assigned using serological typing methods. In this study, we intended to clarify the association between SLA class II genes and reproductive traits in a highly inbred population of 187 Microminipigs (MMP), that have eight different types of SLA class II haplotypes. In doing so, we compared the reproductive performances, such as fertility index, gestation period, litter size, and number of stillbirth among SLA class II low resolution haplotypes (Lrs) that were assigned by a polymerase chain reaction-sequence specific primers (PCR-SSP) typing method. Only low resolution haplotypes were used in this study because the eight SLA class II high-resolution haplotypes had been assigned to the 14 parents or the progenitors of the highly inbred MMP herd in a previous publication. The fertility index of dams with Lr-0.13 was significantly lower than that of dams with Lr-0.16, Lr-0.17, Lr-0.18, or Lr-0.37. Dams with Lr-0.23 had significantly smaller litter size at birth than those with Lr-0.17, Lr-0.18, or Lr-0.37. Furthermore, litter size at weaning of dams with Lr-0.23 was also significantly smaller than those dams with Lr-0.16, Lr-0.17, Lr-0.18, or Lr-0.37. The small litter size of dams with Lr-0.23 correlated with the smaller body sizes of these MMPs. These results suggest that SLA class II haplotypes are useful differential genetic markers for further haplotypic and epistatic studies of reproductive traits, selective breeding programs, and improvements in the production and reproduction performances of MMPs.
Collapse
Affiliation(s)
- Asako Ando
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara 259-1193, Japan
| | - Noriaki Imaeda
- Department of Veterinary Medicine, Faculty of Applied Biological Sciences, Gifu University, Gifu 501-1193, Japan
| | - Tatsuya Matsubara
- Department of Veterinary Medicine, Faculty of Applied Biological Sciences, Gifu University, Gifu 501-1193, Japan
| | - Masaki Takasu
- Department of Veterinary Medicine, Faculty of Applied Biological Sciences, Gifu University, Gifu 501-1193, Japan
| | - Asuka Miyamoto
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara 259-1193, Japan
| | - Shino Oshima
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara 259-1193, Japan
| | - Naohito Nishii
- Department of Veterinary Medicine, Faculty of Applied Biological Sciences, Gifu University, Gifu 501-1193, Japan
| | - Yoshie Kametani
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara 259-1193, Japan
| | - Takashi Shiina
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara 259-1193, Japan
| | - Jerzy K Kulski
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara 259-1193, Japan
- Faculty of Health and Medical Sciences, UWA Medical School, The University of Western Australia, Crawley, WA, 6009, Australia
| | - Hitoshi Kitagawa
- Laboratory of Veterinary Internal Medicine, Faculty of Veterinary Medicine, Okayama University of Science, 1-3 Ikoino-oka, Imabari, Ehime 794-8555, Japan.
| |
Collapse
|
24
|
Olyaee MH, Khanteymoori A, Khalifeh K. Application of Chaotic Laws to Improve Haplotype Assembly Using Chaos Game Representation. Sci Rep 2019; 9:10361. [PMID: 31316124 PMCID: PMC6637069 DOI: 10.1038/s41598-019-46844-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 07/01/2019] [Indexed: 02/06/2023] Open
Abstract
Sequence data are deposited in the form of unphased genotypes and it is not possible to directly identify the location of a particular allele on a specific parental chromosome or haplotype. This study employed nonlinear time series modeling approaches to analyze the haplotype sequences obtained from the NGS sequencing method. To evaluate the chaotic behavior of haplotypes, we analyzed their whole sequences, as well as several subsequences from distinct haplotypes, in terms of the SNP distribution on their chromosomes. This analysis utilized chaos game representation (CGR) followed by the application of two different scaling methods. It was found that chaotic behavior clearly exists in most haplotype subsequences. For testing the applicability of the proposed model, the present research determined the alleles in gap positions and positions with low coverage by using chromosome subsequences in which 10% of each subsequence's alleles are replaced by gaps. After conversion of the subsequences' CGR into the coordinate series, a Local Projection (LP) method predicted the measure of ambiguous positions in the coordinate series. It was discovered that the average reconstruction rate for all input data is more than 97%, demonstrating that applying this knowledge can effectively improve the reconstruction rate of given haplotypes.
Collapse
Affiliation(s)
| | | | - Khosrow Khalifeh
- Department of Biology, Faculty of Sciences, University of Zanjan, Zanjan, Iran
| |
Collapse
|
25
|
Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol 2019; 20:116. [PMID: 31159868 PMCID: PMC6547545 DOI: 10.1186/s13059-019-1709-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 05/06/2019] [Indexed: 12/19/2022] Open
Abstract
Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.
Collapse
Affiliation(s)
- Jana Ebler
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.
| |
Collapse
|
26
|
Tian S, Yan H, Klee EW, Kalmbach M, Slager SL. Comparative analysis of de novo assemblers for variation discovery in personal genomes. Brief Bioinform 2019; 19:893-904. [PMID: 28407084 PMCID: PMC6169673 DOI: 10.1093/bib/bbx037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/08/2017] [Indexed: 12/30/2022] Open
Abstract
Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Eric W Klee
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.,Center for Individualized Medicine Bioinformatics Program, Mayo Clinic, USA
| | - Michael Kalmbach
- Division of Information Management and Analytics, Department of Information Technology, Mayo Clinic, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
27
|
Motazedi E, Finkers R, Maliepaard C, de Ridder D. Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Brief Bioinform 2019; 19:387-403. [PMID: 28065918 DOI: 10.1093/bib/bbw126] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2016] [Indexed: 11/12/2022] Open
Abstract
Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1 kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.
Collapse
Affiliation(s)
- Ehsan Motazedi
- Bioinformatics Group, Wageningen University and Research, The Netherlands.,Wageningen UR Plant Breeding, The Netherlands
| | | | | | - Dick de Ridder
- Bioinformatics Group, Wageningen University and Research, The Netherlands
| |
Collapse
|
28
|
Abstract
Affordable, high-throughput DNA sequencing has accelerated the pace of genome assembly over the past decade. Genome assemblies from high-throughput, short-read sequencing, however, are often not as contiguous as the first generation of genome assemblies. Whereas early genome assembly projects were often aided by clone maps or other mapping data, many current assembly projects forego these scaffolding data and only assemble genomes into smaller segments. Recently, new technologies have been invented that allow chromosome-scale assembly at a lower cost and faster speed than traditional methods. Here, we give an overview of the problem of chromosome-scale assembly and traditional methods for tackling this problem. We then review new technologies for chromosome-scale assembly and recent genome projects that used these technologies to create highly contiguous genome assemblies at low cost.
Collapse
Affiliation(s)
- Edward S. Rice
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA;,
| | - Richard E. Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA;,
- Dovetail Genomics, LLC, Santa Cruz, California 95060, USA
| |
Collapse
|
29
|
Lin YY, Wu PC, Chen PL, Oyang YJ, Chen CY. HAHap: a read-based haplotyping method using hierarchical assembly. PeerJ 2018; 6:e5852. [PMID: 30397550 PMCID: PMC6214236 DOI: 10.7717/peerj.5852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 09/27/2018] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. RESULTS This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.
Collapse
Affiliation(s)
- Yu-Yu Lin
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Ping Chun Wu
- Taipei Blood Center, Taiwan Blood Services Foundation, Taipei, Taiwan
| | - Pei-Lung Chen
- Graduate Institute of Medical Genomics and Proteomics, College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Yen-Jen Oyang
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Chien-Yu Chen
- Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
30
|
Beretta S, Patterson MD, Zaccaria S, Della Vedova G, Bonizzoni P. HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads. BMC Bioinformatics 2018; 19:252. [PMID: 29970002 PMCID: PMC6029272 DOI: 10.1186/s12859-018-2253-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 06/18/2018] [Indexed: 01/08/2023] Open
Abstract
Background Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. Results Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. Conclusions Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. Availability HapCHAT is available at http://hapchat.algolab.euunder the GNU Public License (GPL).
Collapse
Affiliation(s)
- Stefano Beretta
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Murray D Patterson
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy.
| | - Simone Zaccaria
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
31
|
Garg S, Rautiainen M, Novak AM, Garrison E, Durbin R, Marschall T. A graph-based approach to diploid genome assembly. Bioinformatics 2018; 34:i105-i114. [PMID: 29949989 PMCID: PMC6022571 DOI: 10.1093/bioinformatics/bty279] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community. Results We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation https://github.com/whatshap/whatshap. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shilpa Garg
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, Germany
- Department of Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, Germany
- Department of Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Adam M Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, Germany
- Department of Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
| |
Collapse
|
32
|
Abstract
Motivation Current technologies for single-cell DNA sequencing require whole-genome amplification (WGA), as a single cell contains too little DNA for direct sequencing. Unfortunately, WGA introduces biases in the resulting sequencing data, including non-uniformity in genome coverage and high rates of allele dropout. These biases complicate many downstream analyses, including the detection of genomic variants. Results We show that amplification biases have a potential upside: long-range correlations in rates of allele dropout provide a signal for phasing haplotypes at the lengths of amplicons from WGA, lengths which are generally longer than than individual sequence reads. We describe a statistical test to measure concurrent allele dropout between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. We use results of this test to perform haplotype assembly across a collection of single cells. We demonstrate that the algorithm predicts phasing between pairs of SNPs with higher accuracy than phasing from reads alone. Using whole-genome sequencing data from only seven neural cells, we obtain haplotype blocks that are orders of magnitude longer than with sequence reads alone (median length 10.2 kb versus 312 bp), with error rates <2%. We demonstrate similar advantages on whole-exome data from 16 cells, where we obtain haplotype blocks with median length 9.2 kb-comparable to typical gene lengths-compared with median lengths of 41 bp with sequence reads alone, with error rates <4%. Our algorithm will be useful for haplotyping of rare alleles and studies of allele-specific somatic aberrations. Availability and implementation Source code is available at https://www.github.com/raphael-group. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gryte Satas
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Department of Computer Science, Brown University, Providence, RI, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| |
Collapse
|
33
|
Abstract
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
Collapse
|
34
|
Porubsky D, Garg S, Sanders AD, Korbel JO, Guryev V, Lansdorp PM, Marschall T. Dense and accurate whole-chromosome haplotyping of individual genomes. Nat Commun 2017; 8:1293. [PMID: 29101320 PMCID: PMC5670131 DOI: 10.1038/s41467-017-01389-4] [Citation(s) in RCA: 65] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 09/14/2017] [Indexed: 12/15/2022] Open
Abstract
The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.
Collapse
Affiliation(s)
- David Porubsky
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Building 3226, 9713 AV, Groningen, The Netherlands
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Shilpa Garg
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, 66123, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, 66123, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, 66123, Saarbrücken, Germany
| | - Ashley D Sanders
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstraße 1, 69117, Heidelberg, Germany
- Terry Fox Laboratory, BC Cancer Agency, 601 West 10th Avenue, Vancouver, BC, V5Z 1L3, Canada
| | - Jan O Korbel
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Building 3226, 9713 AV, Groningen, The Netherlands
| | - Peter M Lansdorp
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Building 3226, 9713 AV, Groningen, The Netherlands
- Terry Fox Laboratory, BC Cancer Agency, 601 West 10th Avenue, Vancouver, BC, V5Z 1L3, Canada
- Department of Medical Genetics, University of British Columbia, 2350 Health Science Mall, Vancouver, BC, V6T 1Z3, Canada
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, 66123, Saarbrücken, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, 66123, Saarbrücken, Germany.
| |
Collapse
|
35
|
Huang M, Tu J, Lu Z. Recent Advances in Experimental Whole Genome Haplotyping Methods. Int J Mol Sci 2017; 18:E1944. [PMID: 28891974 PMCID: PMC5618593 DOI: 10.3390/ijms18091944] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Revised: 09/01/2017] [Accepted: 09/05/2017] [Indexed: 01/06/2023] Open
Abstract
Haplotype plays a vital role in diverse fields; however, the sequencing technologies cannot resolve haplotype directly. Pioneers demonstrated several approaches to resolve haplotype in the early years, which was extensively reviewed. Since then, numerous methods have been developed recently that have significantly improved phasing performance. Here, we review experimental methods that have emerged mainly over the past five years, and categorize them into five classes according to their maximum scale of contiguity: (i) encapsulation, (ii) 3D structure capture and construction, (iii) compartmentalization, (iv) fluorography, (v) long-read sequencing. Several subsections of certain methods are attached to each class as instances. We also discuss the relative advantages and disadvantages of different classes and make comparisons among representative methods of each class.
Collapse
Affiliation(s)
- Mengting Huang
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Jing Tu
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Zuhong Lu
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| |
Collapse
|
36
|
Abstract
MOTIVATION Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information-reads and pedigree-has the potential to deliver results better than each individually. RESULTS We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual. AVAILABILITY AND IMPLEMENTATION https://bitbucket.org/whatshap/whatshap CONTACT t.marschall@mpi-inf.mpg.de.
Collapse
Affiliation(s)
- Shilpa Garg
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken, Germany Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Marcel Martin
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, SE-17121 Solna, Sweden
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
37
|
Ben-Elazar S, Chor B, Yakhini Z. Extending partial haplotypes to full genome haplotypes using chromosome conformation capture data. Bioinformatics 2017; 32:i559-i566. [PMID: 27587675 DOI: 10.1093/bioinformatics/btw453] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Complex interactions among alleles often drive differences in inherited properties including disease predisposition. Isolating the effects of these interactions requires phasing information that is difficult to measure or infer. Furthermore, prevalent sequencing technologies used in the essential first step of determining a haplotype limit the range of that step to the span of reads, namely hundreds of bases. With the advent of pseudo-long read technologies, observable partial haplotypes can span several orders of magnitude more. Yet, measuring whole-genome-single-individual haplotypes remains a challenge. A different view of whole genome measurement addresses the 3D structure of the genome-with great development of Hi-C techniques in recent years. A shortcoming of current Hi-C, however, is the difficulty in inferring information that is specific to each of a pair of homologous chromosomes. RESULTS In this work, we develop a robust algorithmic framework that takes two measurement derived datasets: raw Hi-C and partial short-range haplotypes, and constructs the full-genome haplotype as well as phased diploid Hi-C maps. By analyzing both data sets together we thus bridge important gaps in both technologies-from short to long haplotypes and from un-phased to phased Hi-C. We demonstrate that our method can recover ground truth haplotypes with high accuracy, using measured biological data as well as simulated data. We analyze the impact of noise, Hi-C sequencing depth and measured haplotype lengths on performance. Finally, we use the inferred 3D structure of a human genome to point at transcription factor targets nuclear co-localization. AVAILABILITY AND IMPLEMENTATION The implementation available at https://github.com/YakhiniGroup/SpectraPh CONTACT zohar.yakhini@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shay Ben-Elazar
- Department of Computer Science, Tel-Aviv University, Israel Microsoft R&D, HerzlyiaIsrael
| | - Benny Chor
- Department of Computer Science, Tel-Aviv University, Israel
| | - Zohar Yakhini
- Agilent Laboratories, Tel-Aviv, Israel Computer Science Department, Technion - Israel Institute of Technology, Haifa, Israel School of computer science, Herzeliya Interdisciplinary Center
| |
Collapse
|
38
|
Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res 2017; 27:757-767. [PMID: 28381613 PMCID: PMC5411770 DOI: 10.1101/gr.214874.116] [Citation(s) in RCA: 494] [Impact Index Per Article: 70.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2016] [Accepted: 03/10/2017] [Indexed: 01/17/2023]
Abstract
Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of reference bias, but nearly all were constructed by merging homologous loci into single "consensus" sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing, and one using thousands of clone pools. Here, we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ∼1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new "pushbutton" algorithm, Supernova. Each computation took 2 d on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.
Collapse
Affiliation(s)
| | - Vijay Kumar
- 10x Genomics, Pleasanton, California 94566, USA
| | - Preyas Shah
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | |
Collapse
|
39
|
Abstract
A haplotype is a string of nucleotides or alleles at nearby loci on one chromosome, usually inherited as a unit. Within the major histocompatibility complex (MHC) region on human chromosome 6p, independent population studies of multiple families have identified conserved extended haplotypes (CEHs) that segregate as long stretches (≥1 megabase) of essentially identical DNA sequence at relatively high (≥0.5 %) population frequency ("genetic fixity"). CEHs were first identified through segregation analysis in the early 1980s. In European Caucasian populations, the most frequent 30 CEHs account for at least one-third of all MHC haplotypes. These CEHs provide all of the known individual MHC susceptibility and protective genetic markers within those populations for several complex genetic diseases. Haplotypes are rigorously determined directly by sequencing single chromosomes or by Mendelian segregation analysis using families with informative genotypes. Four parental haplotypes are assigned unambiguously using genotypes from the two parents and from two of their haploidentical (to each other) children. However, the most common current technique to phase haplotypes is probabilistic statistical imputation, using unrelated subjects. Such probabilistic techniques have failed to detect CEHs and are thus of questionable value in identifying long-range haplotype structure and, consequently, genetic structure-function relationships. Finally, with haplotypes rigorously defined, association studies can determine frequencies of alleles among unrelated patient haplotypes vs. those among only unaffected family members (i.e., control alleles/haplotypes). Such studies reduce, as much as possible, the confounding effects of population stratification common to all genetic studies.
Collapse
Affiliation(s)
- Chester A Alper
- Program in Cellular and Molecular Medicine, Boston Children's Hospital, CLS_03, 3 Blackfan Circle, Boston, MA, 02115, USA.
- Department of Pediatrics, Harvard Medical School, 25 Shattuck Street, Boston, MA, 02115, USA.
| | - Charles E Larsen
- Program in Cellular and Molecular Medicine, Boston Children's Hospital, CLS_03, 3 Blackfan Circle, Boston, MA, 02115, USA
- Department of Medicine, Harvard Medical School, 25 Shattuck Street, Boston, MA, 02115, USA
| |
Collapse
|
40
|
Berman J, Forche A. Haplotyping a Non-meiotic Diploid Fungal Pathogen Using Induced Aneuploidies and SNP/CGH Microarray Analysis. Methods Mol Biol 2017; 1551:131-146. [PMID: 28138844 PMCID: PMC5482211 DOI: 10.1007/978-1-4939-6750-6_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The generation of haplotype information has recently become very attractive due to its utility for identifying mutations associated with human disease and for the development of personalized medicine. Haplotype information also is crucial for studying recombination mechanisms and genetic diversity, and for analyzing allele-specific gene expression. Classic haplotyping methods require the analysis of hundreds of meiotic progeny. To facilitate haplotyping in the non-meiotic human fungal pathogen Candida albicans, we exploited trisomic heterozygous chromosomes generated via the UAU1 selection strategy. Using this system, we obtained phasing information from allelic biases, detected by SNP/CGH microarray analysis. This strategy has the potential to be applicable to other diploid, asexual Candida species that are important causes of human disease.
Collapse
Affiliation(s)
- Judith Berman
- Department of Molecular Microbiology & Biotechnology, Tel Aviv University, Ramat Aviv, Israel
| | - Anja Forche
- Department of Biology, Bowdoin College, Brunswick, ME, USA.
| |
Collapse
|
41
|
Abstract
Phase information of an individual genome provides fundamentally useful genetic information for the understanding of genome function, phenotype, and disease. With the development of new sequencing technology, much interest has been focused on the challenges in obtaining long-range phase information. Here, we present the detailed protocol for a method capable of generating genomic sequences completely phased across the entire chromosome through FACS-mediated chromosome sorting and next generation sequencing, known as Phase-seq.
Collapse
Affiliation(s)
- Xi Chen
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
- Bio-X Program, Stanford University, Stanford, CA, 94305, USA
| | - Hong Yang
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
- Bio-X Program, Stanford University, Stanford, CA, 94305, USA
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA.
- Bio-X Program, Stanford University, Stanford, CA, 94305, USA.
- Department of Health Research and Policy, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
42
|
Edge P, Bafna V, Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res 2016; 27:801-812. [PMID: 27940952 PMCID: PMC5411775 DOI: 10.1101/gr.213462.116] [Citation(s) in RCA: 205] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/08/2016] [Indexed: 11/24/2022]
Abstract
Many tools have been developed for haplotype assembly—the reconstruction of individual haplotypes using reads mapped to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes. However, existing computational methods designed to handle specific technologies do not scale well on data from different protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple sequencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types—dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation (Hi-C) sequencing—we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-resolution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromosomes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse sequencing technologies.
Collapse
Affiliation(s)
- Peter Edge
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA
| | - Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California 92053, USA
| |
Collapse
|
43
|
Porubský D, Sanders AD, van Wietmarschen N, Falconer E, Hills M, Spierings DCJ, Bevova MR, Guryev V, Lansdorp PM. Direct chromosome-length haplotyping by single-cell sequencing. Genome Res 2016; 26:1565-1574. [PMID: 27646535 PMCID: PMC5088598 DOI: 10.1101/gr.209841.116] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2016] [Accepted: 09/15/2016] [Indexed: 11/25/2022]
Abstract
Haplotypes are fundamental to fully characterize the diploid genome of an individual, yet methods to directly chart the unique genetic makeup of each parental chromosome are lacking. Here we introduce single-cell DNA template strand sequencing (Strand-seq) as a novel approach to phasing diploid genomes along the entire length of all chromosomes. We demonstrate this by building a complete haplotype for a HapMap individual (NA12878) at high accuracy (concordance 99.3%), without using generational information or statistical inference. By use of this approach, we mapped all meiotic recombination events in a family trio with high resolution (median range ∼14 kb) and phased larger structural variants like deletions, indels, and balanced rearrangements like inversions. Lastly, the single-cell resolution of Strand-seq allowed us to observe loss of heterozygosity regions in a small number of cells, a significant advantage for studies of heterogeneous cell populations, such as cancer cells. We conclude that Strand-seq is a unique and powerful approach to completely phase individual genomes and map inheritance patterns in families, while preserving haplotype differences between single cells.
Collapse
Affiliation(s)
- David Porubský
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, 9713 AV Groningen, The Netherlands
| | - Ashley D Sanders
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada
| | - Niek van Wietmarschen
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, 9713 AV Groningen, The Netherlands
| | - Ester Falconer
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada
| | - Mark Hills
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada
| | - Diana C J Spierings
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, 9713 AV Groningen, The Netherlands
| | - Marianna R Bevova
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, 9713 AV Groningen, The Netherlands
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, 9713 AV Groningen, The Netherlands
| | - Peter M Lansdorp
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, 9713 AV Groningen, The Netherlands
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| |
Collapse
|
44
|
Bracciali A, Aldinucci M, Patterson M, Marschall T, Pisanti N, Merelli I, Torquati M. PWHATSHAP: efficient haplotyping for future generation sequencing. BMC Bioinformatics 2016; 17:342. [PMID: 28185544 PMCID: PMC5046197 DOI: 10.1186/s12859-016-1170-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology’s current trends that are producing longer fragments. Results Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage. Conclusions Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the efficiency of the analysis of genomics information.
Collapse
Affiliation(s)
- Andrea Bracciali
- Computer Science and Mathematics, School of Natural Sciences, Stirling University, Stirling, FK9 4LA, UK.
| | - Marco Aldinucci
- Department of Computer Science, University of Torino, Torino, Italy
| | - Murray Patterson
- Laboratoire de Biométrie et Biologie Evolutive, University Claude Bernard, Lyon, France
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland, Germany.,Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Nadia Pisanti
- Department of Computer Science, University of Pisa, Pisa, Italy.,Erable Team, INRIA, Grenoble, France
| | - Ivan Merelli
- Institute of Biomedical Technologies, National Research Council, Milan, Italy
| | | |
Collapse
|
45
|
Li C, Cao C, Tu J, Sun X. An accurate clone-based haplotyping method by overlapping pool sequencing. Nucleic Acids Res 2016; 44:e112. [PMID: 27095193 PMCID: PMC4937318 DOI: 10.1093/nar/gkw284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2015] [Accepted: 04/07/2016] [Indexed: 11/25/2022] Open
Abstract
Chromosome-long haplotyping of human genomes is important to identify genetic variants with differing gene expression, in human evolution studies, clinical diagnosis, and other biological and medical fields. Although several methods have realized haplotyping based on sequencing technologies or population statistics, accuracy and cost are factors that prohibit their wide use. Borrowing ideas from group testing theories, we proposed a clone-based haplotyping method by overlapping pool sequencing. The clones from a single individual were pooled combinatorially and then sequenced. According to the distinct pooling pattern for each clone in the overlapping pool sequencing, alleles for the recovered variants could be assigned to their original clones precisely. Subsequently, the clone sequences could be reconstructed by linking these alleles accordingly and assembling them into haplotypes with high accuracy. To verify the utility of our method, we constructed 130 110 clones in silico for the individual NA12878 and simulated the pooling and sequencing process. Ultimately, 99.9% of variants on chromosome 1 that were covered by clones from both parental chromosomes were recovered correctly, and 112 haplotype contigs were assembled with an N50 length of 3.4 Mb and no switch errors. A comparison with current clone-based haplotyping methods indicated our method was more accurate.
Collapse
Affiliation(s)
- Cheng Li
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| | - Changchang Cao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| | - Jing Tu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| | - Xiao Sun
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| |
Collapse
|
46
|
Schütz E, Wehrhahn C, Wanjek M, Bortfeld R, Wemheuer WE, Beck J, Brenig B. The Holstein Friesian Lethal Haplotype 5 (HH5) Results from a Complete Deletion of TBF1M and Cholesterol Deficiency (CDH) from an ERV-(LTR) Insertion into the Coding Region of APOB. PLoS One 2016; 11:e0154602. [PMID: 27128314 PMCID: PMC4851415 DOI: 10.1371/journal.pone.0154602] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Accepted: 04/17/2016] [Indexed: 12/17/2022] Open
Abstract
Background With the availability of massive SNP data for several economically important cattle breeds, haplotype tests have been performed to identify unknown recessive disorders. A number of so-called lethal haplotypes, have been uncovered in Holstein Friesian cattle and, for at least seven of these, the causative mutations have been identified in candidate genes. However, several lethal haplotypes still remain elusive. Here we report the molecular genetic causes of lethal haplotype 5 (HH5) and cholesterol deficiency (CDH). A targeted enrichment for the known genomic regions, followed by massive parallel sequencing was used to interrogate for causative mutations in a case/control approach. Methods Targeted enrichment for the known genomic regions, followed by massive parallel sequencing was used in a case/control approach. PCRs for the causing mutations were developed and compared to routine imputing in 2,100 (HH5) and 3,100 (CDH) cattle. Results HH5 is caused by a deletion of 138kbp, spanning position 93,233kb to 93,371kb on chromosome 9 (BTA9), harboring only dimethyl-adenosine transferase 1 (TFB1M). The deletion breakpoints are flanked by bovine long interspersed nuclear elements Bov-B (upstream) and L1ME3 (downstream), suggesting a homologous recombination/deletion event. TFB1M di-methylates adenine residues in the hairpin loop at the 3’-end of mitochondrial 12S rRNA, being essential for synthesis and function of the small ribosomal subunit of mitochondria. Homozygous TFB1M-/- mice reportedly exhibit embryonal lethality with developmental defects. A 2.8% allelic frequency was determined for the German HF population. CDH results from a 1.3kbp insertion of an endogenous retrovirus (ERV2-1-LTR_BT) into exon 5 of the APOB gene at BTA11:77,959kb. The insertion is flanked by 6bp target site duplications as described for insertions mediated by retroviral integrases. A premature stop codon in the open reading frame of APOB is generated, resulting in a truncation of the protein to a length of only <140 amino acids. Such early truncations have been shown to cause an inability of chylomicron excretion from intestinal cells, resulting in malabsorption of cholesterol. The allelic frequency of this mutation in the German HF population was 6.7%, which is substantially higher than reported so far. Compared to PCR assays inferring the genetic variants directly, the routine imputing used so far showed a diagnostic sensitivity of as low as 91% (HH5) and 88% (CDH), with a high specificity for both (≥99.7%). Conclusion With the availability of direct genetic tests it will now be possible to more effectively reduce the carrier frequency and ultimately eliminate the disorders from the HF populations. Beside this, the fact that repetitive genomic elements (RE) are involved in both diseases, underline the evolutionary importance of RE, which can be detrimental as here, but also advantageous over generations.
Collapse
Affiliation(s)
- Ekkehard Schütz
- Institute of Veterinary Medicine, Georg-August-University Göttingen, Göttingen, Germany
- Chronix Biomedical GmbH, Göttingen, Germany
- * E-mail:
| | - Christin Wehrhahn
- Institute of Veterinary Medicine, Georg-August-University Göttingen, Göttingen, Germany
| | - Marius Wanjek
- Institute for Livestock Reproduction GmbH, Schönow, Germany
| | - Ralf Bortfeld
- Institute for Livestock Reproduction GmbH, Schönow, Germany
| | - Wilhelm E. Wemheuer
- Institute of Veterinary Medicine, Georg-August-University Göttingen, Göttingen, Germany
| | - Julia Beck
- Chronix Biomedical GmbH, Göttingen, Germany
| | - Bertram Brenig
- Institute of Veterinary Medicine, Georg-August-University Göttingen, Göttingen, Germany
| |
Collapse
|
47
|
|
48
|
Shyr C, Kushniruk A, van Karnebeek CDM, Wasserman WW. Dynamic software design for clinical exome and genome analyses: insights from bioinformaticians, clinical geneticists, and genetic counselors. J Am Med Inform Assoc 2016; 23:257-68. [PMID: 26117142 PMCID: PMC4784553 DOI: 10.1093/jamia/ocv053] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/03/2015] [Accepted: 04/22/2015] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND The transition of whole-exome and whole-genome sequencing (WES/WGS) from the research setting to routine clinical practice remains challenging. OBJECTIVES With almost no previous research specifically assessing interface designs and functionalities of WES and WGS software tools, the authors set out to ascertain perspectives from healthcare professionals in distinct domains on optimal clinical genomics user interfaces. METHODS A series of semi-scripted focus groups, structured around professional challenges encountered in clinical WES and WGS, were conducted with bioinformaticians (n = 8), clinical geneticists (n = 9), genetic counselors (n = 5), and general physicians (n = 4). RESULTS Contrary to popular existing system designs, bioinformaticians preferred command line over graphical user interfaces for better software compatibility and customization flexibility. Clinical geneticists and genetic counselors desired an overarching interactive graphical layout to prioritize candidate variants--a "tiered" system where only functionalities relevant to the user domain are made accessible. They favored a system capable of retrieving consistent representations of external genetic information from third-party sources. To streamline collaboration and patient exchanges, the authors identified user requirements toward an automated reporting system capable of summarizing key evidence-based clinical findings among the vast array of technical details. CONCLUSIONS Successful adoption of a clinical WES/WGS system is heavily dependent on its ability to address the diverse necessities and predilections among specialists in distinct healthcare domains. Tailored software interfaces suitable for each group is likely more appropriate than the current popular "one size fits all" generic framework. This study provides interfaces for future intervention studies and software engineering opportunities.
Collapse
Affiliation(s)
- Casper Shyr
- Centre for Molecular Medicine and Therapeutics; Child and Family Research Institute, Vancouver BC, Canada Bioinformatics Graduate Program, University of British Columbia, Vancouver BC, Canada Treatable Intellectual Disability Endeavour in British Columbia (www.tidebc.org), Vancouver, Canada
| | - Andre Kushniruk
- School of Health Information Science, University of Victoria, 3800 Finnerty Rd, Victoria, BC V8P 5C2, Canada
| | - Clara D M van Karnebeek
- Treatable Intellectual Disability Endeavour in British Columbia (www.tidebc.org), Vancouver, Canada Division of Biochemical Diseases, BC Children's Hospital, Vancouver BC, Canada Department of Pediatrics, University of British Columbia, Vancouver BC, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics; Child and Family Research Institute, Vancouver BC, Canada Treatable Intellectual Disability Endeavour in British Columbia (www.tidebc.org), Vancouver, Canada Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
| |
Collapse
|
49
|
Rhee JK, Li H, Joung JG, Hwang KB, Zhang BT, Shin SY. Survey of computational haplotype determination methods for single individual. Genes Genomics 2015. [DOI: 10.1007/s13258-015-0342-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|