1
|
Sun S, Cheng F, Han D, Wei S, Zhong A, Massoudian S, Johnson AB. Pairwise comparative analysis of six haplotype assembly methods based on users' experience. BMC Genom Data 2023; 24:35. [PMID: 37386408 PMCID: PMC10311811 DOI: 10.1186/s12863-023-01134-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 05/25/2023] [Indexed: 07/01/2023] Open
Abstract
BACKGROUND A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. RESULT Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms' run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. CONCLUSION The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
Collapse
Affiliation(s)
- Shuying Sun
- Department of Mathematics, Texas State University, San Marcos, TX USA
| | - Flora Cheng
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Daphne Han
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Sarah Wei
- Massachusetts Institute of Technology, Cambridge, MA USA
| | | | | | | |
Collapse
|
2
|
Paes J, Silva GAV, Tarragô AM, Mourão LPDS. The Contribution of JAK2 46/1 Haplotype in the Predisposition to Myeloproliferative Neoplasms. Int J Mol Sci 2022; 23:12582. [PMID: 36293440 PMCID: PMC9604447 DOI: 10.3390/ijms232012582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/13/2022] [Accepted: 10/15/2022] [Indexed: 11/17/2022] Open
Abstract
Haplotype 46/1 (GGCC) consists of a set of genetic variations distributed along chromosome 9p.24.1, which extend from the Janus Kinase 2 gene to Insulin like 4. Marked by four jointly inherited variants (rs3780367, rs10974944, rs12343867, and rs1159782), this haplotype has a strong association with the development of BCR-ABL1-negative myeloproliferative neoplasms (MPNs) because it precedes the acquisition of the JAK2V617F variant, a common genetic alteration in individuals with these hematological malignancies. It is also described as one of the factors that increases the risk of familial MPNs by more than five times, 46/1 is associated with events related to inflammatory dysregulation, splenomegaly, splanchnic vein thrombosis, Budd-Chiari syndrome, increases in RBC count, platelets, leukocytes, hematocrit, and hemoglobin, which are characteristic of MPNs, as well as other findings that are still being elucidated and which are of great interest for the etiopathological understanding of these hematological neoplasms. Considering these factors, the present review aims to describe the main findings and discussions involving the 46/1 haplotype, and highlights the molecular and immunological aspects and their relevance as a tool for clinical practice and investigation of familial cases.
Collapse
Affiliation(s)
- Jhemerson Paes
- Programa de Pós-Graduação em Ciências Aplicadas à Hematologia, Universidade do Estado do Amazonas (UEA), Manaus 69850-000, AM, Brazil
| | - George A. V. Silva
- Programa de Pós-Graduação em Ciências Aplicadas à Hematologia, Universidade do Estado do Amazonas (UEA), Manaus 69850-000, AM, Brazil
- Fundação Hospitalar de Hematologia e Hemoterapia do Amazonas (FHEMOAM), Manaus 69050-001, AM, Brazil
- Fundação Oswaldo Cruz–Instituto Leônidas e Maria Deane (Fiocruz), Manaus 69027-070, AM, Brazil
| | - Andréa M. Tarragô
- Programa de Pós-Graduação em Ciências Aplicadas à Hematologia, Universidade do Estado do Amazonas (UEA), Manaus 69850-000, AM, Brazil
- Fundação Hospitalar de Hematologia e Hemoterapia do Amazonas (FHEMOAM), Manaus 69050-001, AM, Brazil
| | - Lucivana P. de Souza Mourão
- Programa de Pós-Graduação em Ciências Aplicadas à Hematologia, Universidade do Estado do Amazonas (UEA), Manaus 69850-000, AM, Brazil
- Fundação Hospitalar de Hematologia e Hemoterapia do Amazonas (FHEMOAM), Manaus 69050-001, AM, Brazil
| |
Collapse
|
3
|
Ura H, Togi S, Niida Y. Targeted Double-Stranded cDNA Sequencing-Based Phase Analysis to Identify Compound Heterozygous Mutations and Differential Allelic Expression. BIOLOGY 2021; 10:biology10040256. [PMID: 33804940 PMCID: PMC8063809 DOI: 10.3390/biology10040256] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Revised: 03/22/2021] [Accepted: 03/22/2021] [Indexed: 11/16/2022]
Abstract
Simple Summary Phase analysis to distinguish between in cis and in trans heterozygous mutations is important for clinical diagnosis because in trans compound heterozygous mutations cause autosomal recessive diseases. However, conventional phase analysis is limited because of the large target size of genomic DNA. Here, we performed a targeted double-stranded cDNA sequencing-based phase analysis to resolve the limitation of distance using direct adapter ligation library preparation and paired-end sequencing; we elucidated that two heterozygous mutations on a patient with Wilson disease are in trans compound heterozygous mutations. Furthermore, we detected the differential allelic expression. Our results indicate that a targeted double-stranded cDNA sequencing-based phase analysis is useful for determining compound heterozygous mutations and confers information on allelic expression. Abstract There are two combinations of heterozygous mutation, i.e., in trans, which carries mutations on different alleles, and in cis, which carries mutations on the same allele. Because only in trans compound heterozygous mutations have been implicated in autosomal recessive diseases, it is important to distinguish them for clinical diagnosis. However, conventional phase analysis is limited because of the large target size of genomic DNA. Here, we performed a genetic analysis on a patient with Wilson disease, and we detected two heterozygous mutations chr13:51958362;G>GG (NM_000053.4:c.2304dup r.2304dup p.Met769HisfsTer26) and chr13:51964900;C>T (NM_000053.4:c.1841G>A r.1841g>a p.Gly614Asp) in the causative gene ATP7B. The distance between the two mutations was 6.5 kb in genomic DNA but 464 bp in mRNA. Targeted double-stranded cDNA sequencing-based phase analysis was performed using direct adapter ligation library preparation and paired-end sequencing, and we elucidated they are in trans compound heterozygous mutations. Trio analysis showed that the mutation (chr13:51964900;C>T) derived from the father and the other mutation from the mother, validating that the mutations are in trans composition. Furthermore, targeted double-stranded cDNA sequencing-based phase analysis detected the differential allelic expression, suggesting that the mutation (chr13:51958362;G>GG) caused downregulation of expression by nonsense-mediated mRNA decay. Our results indicate that targeted double-stranded cDNA sequencing-based phase analysis is useful for determining compound heterozygous mutations and confers information on allelic expression.
Collapse
Affiliation(s)
- Hiroki Ura
- Center for Clinical Genomics, Kanazawa Medical University Hospital, 1-1 Daigaku, Uchinada, Kahoku, Ishikawa 920-0923, Japan; (S.T.); (Y.N.)
- Division of Genomic Medicine, Department of Advanced Medicine, Medical Research Institute, Kanazawa Medical University, 1-1 Daigaku, Uchinada, Kahoku, Ishikawa 920-0923, Japan
- Correspondence: ; Tel.: +81-076-286-2211 (ext. 8353)
| | - Sumihito Togi
- Center for Clinical Genomics, Kanazawa Medical University Hospital, 1-1 Daigaku, Uchinada, Kahoku, Ishikawa 920-0923, Japan; (S.T.); (Y.N.)
- Division of Genomic Medicine, Department of Advanced Medicine, Medical Research Institute, Kanazawa Medical University, 1-1 Daigaku, Uchinada, Kahoku, Ishikawa 920-0923, Japan
| | - Yo Niida
- Center for Clinical Genomics, Kanazawa Medical University Hospital, 1-1 Daigaku, Uchinada, Kahoku, Ishikawa 920-0923, Japan; (S.T.); (Y.N.)
- Division of Genomic Medicine, Department of Advanced Medicine, Medical Research Institute, Kanazawa Medical University, 1-1 Daigaku, Uchinada, Kahoku, Ishikawa 920-0923, Japan
| |
Collapse
|
4
|
Mocci E, Debeljak M, Klein AP, Eshleman JR. A New Fast Phasing Method Based On Haplotype Subtraction. J Mol Diagn 2019; 21:427-436. [PMID: 30872187 PMCID: PMC6504677 DOI: 10.1016/j.jmoldx.2018.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Revised: 10/26/2018] [Accepted: 12/31/2018] [Indexed: 11/16/2022] Open
Abstract
We developed a novel phasing approach, based solely on molecules and genotype frequency, that does not rely on inference of new alleles. We initiated the project because of errors that were detected in the phased 1000 Genomes Project data. The algorithm first combined identical genotypes into clusters and ranked them by descending frequency. Using alleles defined in homozygotes, it combined them to produce expected genotypes that were dismissed and subtracted them from remaining genotypes to define additional new putative alleles. Putative alleles had to be confirmed by identifying them in independent genotypes, and the process was iterated until all alleles were identified. The new approach was validated using single-molecule sequencing of eight loci, 145 (8 to 35 per locus) alleles were identified, and an average 98.2% (range, 95.0% to 99.9%) of 1000 genome individuals at these loci were explained. The accuracy of the new method was compared with that from PHASE and SHAPEIT2 to the experimentally determined genotypes based on single-molecule sequencing. Our method was comparable to PHASE and SHAPEIT2 in accuracy but was, on average, 14.6- and 10.8-fold faster, respectively.
Collapse
Affiliation(s)
- Evelina Mocci
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Marija Debeljak
- Department of Pathology, Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Alison P Klein
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center; Department of Pathology, Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - James R Eshleman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center; Department of Pathology, Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins School of Medicine, Baltimore, Maryland.
| |
Collapse
|
5
|
Tian S, Yan H, Klee EW, Kalmbach M, Slager SL. Comparative analysis of de novo assemblers for variation discovery in personal genomes. Brief Bioinform 2019; 19:893-904. [PMID: 28407084 PMCID: PMC6169673 DOI: 10.1093/bib/bbx037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/08/2017] [Indexed: 12/30/2022] Open
Abstract
Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Eric W Klee
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.,Center for Individualized Medicine Bioinformatics Program, Mayo Clinic, USA
| | - Michael Kalmbach
- Division of Information Management and Analytics, Department of Information Technology, Mayo Clinic, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
6
|
Abstract
Background Inference of haplotypes, or the sequence of alleles along the same chromosomes, is a fundamental problem in genetics and is a key component for many analyses including admixture mapping, identifying regions of identity by descent and imputation. Haplotype phasing based on sequencing reads has attracted lots of attentions. Diploid haplotype phasing where the two haplotypes are complimentary have been studied extensively. In this work, we focused on Polyploid haplotype phasing where we aim to phase more than two haplotypes at the same time from sequencing data. The problem is much more complicated as the search space becomes much larger and the haplotypes do not need to be complimentary any more. Results We proposed two algorithms, (1) Poly-Harsh, a Gibbs Sampling based algorithm which alternatively samples haplotypes and the read assignments to minimize the mismatches between the reads and the phased haplotypes, (2) An efficient algorithm to concatenate haplotype blocks into contiguous haplotypes. Conclusions Our experiments showed that our method is able to improve the quality of the phased haplotypes over the state-of-the-art methods. To our knowledge, our algorithm for haplotype blocks concatenation is the first algorithm that leverages the shared information across multiple individuals to construct contiguous haplotypes. Our experiments showed that it is both efficient and effective.
Collapse
Affiliation(s)
- Dan He
- College of Computer Science and Software, Shenzhen University, Shenzhen, 518060, China.
| | - Subrata Saha
- IBM T.J. Watson Research Center, 1101 Kitchawan Rd, Yorktown Heights, 10598, NY, USA
| | - Richard Finkers
- Wageningen University & Research, 6708 PB, Wageningen, Netherlands
| | - Laxmi Parida
- IBM T.J. Watson Research Center, 1101 Kitchawan Rd, Yorktown Heights, 10598, NY, USA
| |
Collapse
|
7
|
Abstract
BACKGROUND Haplotype assembly is the task of reconstructing haplotypes of an individual from a mixture of sequenced chromosome fragments. Haplotype information enables studies of the effects of genetic variations on an organism's phenotype. Most of the mathematical formulations of haplotype assembly are known to be NP-hard and haplotype assembly becomes even more challenging as the sequencing technology advances and the length of the paired-end reads and inserts increases. Assembly of haplotypes polyploid organisms is considerably more difficult than in the case of diploids. Hence, scalable and accurate schemes with provable performance are desired for haplotype assembly of both diploid and polyploid organisms. RESULTS We propose a framework that formulates haplotype assembly from sequencing data as a sparse tensor decomposition. We cast the problem as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and [Formula: see text]; tensor [Formula: see text] reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by iteratively solving this decomposition problem is proposed. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. The developed framework is applicable to diploid, biallelic and polyallelic polyploid species. The code for AltHap is freely available from https://github.com/realabolfazl/AltHap . CONCLUSION AltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.
Collapse
Affiliation(s)
- Abolfazl Hashemi
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| | - Banghua Zhu
- EE Department, Tsinghua University, Beijing, China
| | - Haris Vikalo
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
8
|
Castel SE, Mohammadi P, Chung WK, Shen Y, Lappalainen T. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nat Commun 2016; 7:12817. [PMID: 27605262 PMCID: PMC5025529 DOI: 10.1038/ncomms12817] [Citation(s) in RCA: 76] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2016] [Accepted: 08/03/2016] [Indexed: 11/09/2022] Open
Abstract
Haplotype phasing of genetic variants is important for clinical interpretation of the genome, population genetic analysis and functional genomic analysis of allelic activity. Here we present phASER, an accurate approach for phasing variants that are overlapped by sequencing reads, including those from RNA sequencing (RNA-seq), which often span multiple exons due to splicing. Using diverse RNA-seq data we demonstrate that this provides more accurate phasing of rare variants compared with population-based phasing and allows phasing of variants in the same gene up to hundreds of kilobases away that cannot be obtained from DNA sequencing (DNA-seq) reads. We show that in the context of medical genetic studies this improves the resolution of compound heterozygotes. Additionally, phASER provides measures of haplotypic expression that increase power and accuracy in studies of allelic expression. In summary, phasing using RNA-seq and phASER is accurate and improves studies where rare variant haplotypes or allelic expression is needed.
Collapse
Affiliation(s)
- Stephane E. Castel
- New York Genome Center, New York, NY, 10013, USA
- Department of Systems Biology, Columbia University, New York, NY, 10032, USA
| | - Pejman Mohammadi
- New York Genome Center, New York, NY, 10013, USA
- Department of Systems Biology, Columbia University, New York, NY, 10032, USA
| | - Wendy K. Chung
- Departments of Pediatrics and Medicine, Columbia University, New York, NY, 10032, USA
| | - Yufeng Shen
- Department of Systems Biology, Columbia University, New York, NY, 10032, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
| | - Tuuli Lappalainen
- New York Genome Center, New York, NY, 10013, USA
- Department of Systems Biology, Columbia University, New York, NY, 10032, USA
| |
Collapse
|
9
|
Xie M, Wang J, Chen X. LGH: A Fast and Accurate Algorithm for Single Individual Haplotyping Based on a Two-Locus Linkage Graph. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1255-1266. [PMID: 26671798 DOI: 10.1109/tcbb.2015.2430352] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Phased haplotype information is crucial in our complete understanding of differences between individuals at the genetic level. Given a collection of DNA fragments sequenced from a homologous pair of chromosomes, the problem of single individual haplotyping (SIH) aims to reconstruct a pair of haplotypes using a computer algorithm. In this paper, we encode the information of aligned DNA fragments into a two-locus linkage graph and approach the SIH problem by vertex labeling of the graph. In order to find a vertex labeling with the minimum sum of weights of incompatible edges, we develop a fast and accurate heuristic algorithm. It starts with detecting error-tolerant components by an adapted breadth-first search. A proper labeling of vertices is then identified for each component, with which sequencing errors are further corrected and edge weights are adjusted accordingly. After contracting each error-tolerant component into a single vertex, the above procedure is iterated on the resulting condensed linkage graph until error-tolerant components are no longer detected. The algorithm finally outputs a haplotype pair based on the vertex labeling. Extensive experiments on simulated and real data show that our algorithm is more accurate and faster than five existing algorithms for single individual haplotyping.
Collapse
|
10
|
Rhee JK, Li H, Joung JG, Hwang KB, Zhang BT, Shin SY. Survey of computational haplotype determination methods for single individual. Genes Genomics 2015. [DOI: 10.1007/s13258-015-0342-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
11
|
Wang Y, Liang B, Tong X, Marder K, Bressman S, Orr-Urtreger A, Giladi N, Zeng D. Efficient Estimation of Nonparametric Genetic Risk Function with Censored Data. Biometrika 2015; 102:515-532. [PMID: 26412864 PMCID: PMC4581539 DOI: 10.1093/biomet/asv030] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
With an increasing number of causal genes discovered for complex human disorders, it is crucial to assess the genetic risk of disease onset for individuals who are carriers of these causal mutations and compare the distribution of age-at-onset with that in non-carriers. In many genetic epidemiological studies aiming at estimating causal gene effect on disease, the age-at-onset of disease is subject to censoring. In addition, some individuals' mutation carrier or non-carrier status can be unknown due to the high cost of in-person ascertainment to collect DNA samples or death in older individuals. Instead, the probability of these individuals' mutation status can be obtained from various sources. When mutation status is missing, the available data take the form of censored mixture data. Recently, various methods have been proposed for risk estimation from such data, but none is efficient for estimating a nonparametric distribution. We propose a fully efficient sieve maximum likelihood estimation method, in which we estimate the logarithm of the hazard ratio between genetic mutation groups using B-splines, while applying nonparametric maximum likelihood estimation for the reference baseline hazard function. Our estimator can be calculated via an expectation-maximization algorithm which is much faster than existing methods. We show that our estimator is consistent and semiparametrically efficient and establish its asymptotic distribution. Simulation studies demonstrate superior performance of the proposed method, which is applied to the estimation of the distribution of the age-at-onset of Parkinson's disease for carriers of mutations in the leucine-rich repeat kinase 2 gene.
Collapse
Affiliation(s)
- Yuanjia Wang
- Department of Biostatistics, Mailman School of Public Health, 722 W168th Street, New York 10032, U.S.A.
| | - Baosheng Liang
- School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China.
| | - Xingwei Tong
- School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China.
| | - Karen Marder
- Department of Neurology and Psychiatry, College of Physicians and Surgeons, Columbia University, New York 10032, U.S.A.
| | - Susan Bressman
- The Alan and Barbara Mirken Department of Neurology, Beth Israel Medical Center, New York, 10003, U.S.A.
| | - Avi Orr-Urtreger
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| | - Nir Giladi
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| | - Donglin Zeng
- Department of Biostatistics, CB # 7420, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7420, U.S.A.
| |
Collapse
|
12
|
Abstract
Human genomes are diploid and, for their complete description and interpretation, it is necessary not only to discover the variation they contain but also to arrange it onto chromosomal haplotypes. Although whole-genome sequencing is becoming increasingly routine, nearly all such individual genomes are mostly unresolved with respect to haplotype, particularly for rare alleles, which remain poorly resolved by inferential methods. Here, we review emerging technologies for experimentally resolving (that is, 'phasing') haplotypes across individual whole-genome sequences. We also discuss computational methods relevant to their implementation, metrics for assessing their accuracy and completeness, and the relevance of haplotype information to applications of genome sequencing in research and clinical medicine.
Collapse
|
13
|
Patterson M, Marschall T, Pisanti N, van Iersel L, Stougie L, Klau GW, Schönhuth A. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J Comput Biol 2015; 22:498-509. [PMID: 25658651 DOI: 10.1089/cmb.2014.0157] [Citation(s) in RCA: 211] [Impact Index Per Article: 23.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.
Collapse
Affiliation(s)
- Murray Patterson
- 1Laboratoire de Biométrie et Biologie Évolutive (LBBE : UMR CNRS 5558), Université de Lyon 1, Villeurbanne, France
| | - Tobias Marschall
- 2Center for Bioinformatics, Saarland University, Saarbrücken, Germany.,3Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Nadia Pisanti
- 4Department of Computer Science, University of Pisa, Italy.,7Erable Team, INRIA
| | | | - Leen Stougie
- 6VU University, Amsterdam, The Netherlands.,7Erable Team, INRIA
| | - Gunnar W Klau
- 6VU University, Amsterdam, The Netherlands.,7Erable Team, INRIA
| | | |
Collapse
|
14
|
Abstract
Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.
Collapse
Affiliation(s)
- Gustavo Glusman
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| | - Hannah C Cox
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| | - Jared C Roach
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| |
Collapse
|
15
|
Matsumoto H, Kiryu H. Integrating dilution-based sequencing and population genotypes for single individual haplotyping. BMC Genomics 2014; 15:733. [PMID: 25167975 PMCID: PMC4162929 DOI: 10.1186/1471-2164-15-733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2013] [Accepted: 08/18/2014] [Indexed: 11/30/2022] Open
Abstract
Background Haplotype information is useful for many genetic analyses and haplotypes are usually inferred using computational approaches. Among such approaches, the importance of single individual haplotyping (SIH), which infers individual haplotypes from sequence fragments, has been increasing with the advent of novel sequencing techniques, such as dilution-based sequencing. These techniques could produce virtual long read fragments by separating DNA fragments into multiple low-concentration aliquots, sequencing and mapping each aliquot, and merging clustered short reads. Although these experimental techniques are sophisticated, they have the problem of producing chimeric fragments whose left and right parts match different chromosomes. In our previous research, we found that chimeric fragments significantly decrease the accuracy of SIH. Although chimeric fragments can be removed by using haplotypes which are determined from pedigree genotypes, pedigree genotypes are generally not available. The length of reads cluster and heterozygous calls were also used to detect chimeric fragments. Although some chimeric fragments will be removed with these features, considerable number of chimeric fragments will be undetected because of the dispersion of the length and the absence of SNPs in the overlapped regions. For these reasons, a general method to detect and remove chimeric fragments is needed. Results In this paper, we propose a general method to detect chimeric fragments. The basis of our method is that a chimeric fragment would correspond to an artificial recombinant haplotype and would differ from biological haplotypes. To detect differences from biological haplotypes, we integrated statistical phasing, which is a haplotype inference approach from population genotypes, into our method. We applied our method to two datasets and detected chimeric fragments with high AUC. AUC values of our method are higher than those of just using cluster length and heterozygous calls. We then used multiple SIH algorithm to compare the accuracy of SIH before and after removing the chimeric fragment candidates. The accuracy of assembled haplotypes increased significantly after removing chimeric fragment candidates. Conclusions Our method is useful for detecting chimeric fragments and improving SIH accuracy. The Ruby script is available at
https://sites.google.com/site/hmatsu1226/software/csp. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-733) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hirotaka Matsumoto
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, Chiba 277-8561, Japan.
| | | |
Collapse
|
16
|
Mangul S, Wu NC, Mancuso N, Zelikovsky A, Sun R, Eskin E. Accurate viral population assembly from ultra-deep sequencing data. Bioinformatics 2014; 30:i329-37. [PMID: 24932001 PMCID: PMC4058922 DOI: 10.1093/bioinformatics/btu295] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. RESULTS In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation-maximization algorithm to estimate abundances of the assembled viral variants in the population. RESULTS on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. AVAILABILITY Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/
Collapse
Affiliation(s)
- Serghei Mangul
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Nicholas C Wu
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Nicholas Mancuso
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Alex Zelikovsky
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Ren Sun
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USAComputer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
17
|
Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat Commun 2014; 5:3934. [PMID: 25653097 PMCID: PMC4338501 DOI: 10.1038/ncomms4934] [Citation(s) in RCA: 290] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2013] [Accepted: 04/23/2014] [Indexed: 12/15/2022] Open
Abstract
A major use of the 1000 Genomes Project (1000GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low coverage sequencing data that can take advantage of SNP microarray genotypes on the same samples. Firstly the SNP array data are phased in order to build a backbone (or ’scaffold’) of haplotypes across each chromosome. We then phase the sequence data ‘onto’ this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and biallelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low frequency variants.
Collapse
|
18
|
Fujimoto M, Bodily PM, Okuda N, Clement MJ, Snell Q. Effects of error-correction of heterozygous next-generation sequencing data. BMC Bioinformatics 2014; 15 Suppl 7:S3. [PMID: 25077414 PMCID: PMC4110727 DOI: 10.1186/1471-2105-15-s7-s3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Error correction is an important step in increasing the quality of next-generation sequencing data for downstream analysis and use. Polymorphic datasets are a challenge for many bioinformatic software packages that are designed for or assume homozygosity of an input dataset. This assumption ignores the true genomic composition of many organisms that are diploid or polyploid. In this survey, two different error correction packages, Quake and ECHO, are examined to see how they perform on next-generation sequence data from heterozygous genomes. Results Quake and ECHO perform well and were able to correct many errors found within the data. However, errors that occur at heterozygous positions had unique trends. Errors at these positions were sometimes corrected incorrectly, introducing errors into the dataset with the possibility of creating a chimeric read. Quake was much less likely to create chimeric reads. Quake's read trimming removed a large portion of the original data and often left reads with few heterozygous markers. ECHO resulted in more chimeric reads and introduced more errors than Quake but preserved heterozygous markers. Using real E. coli sequencing data and their assemblies after error correction, the assembly statistics improved. It was also found that segregating reads by haplotype can improve the quality of an assembly. Conclusions These findings suggest that Quake and ECHO both have strengths and weaknesses when applied to heterozygous data. With the increased interest in haplotype specific analysis, new tools that are designed to be haplotype-aware are necessary that do not have the weaknesses of Quake and ECHO.
Collapse
|
19
|
Patterson M, Marschall T, Pisanti N, van Iersel L, Stougie L, Klau GW, Schönhuth A. WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. LECTURE NOTES IN COMPUTER SCIENCE 2014. [DOI: 10.1007/978-3-319-05269-4_19] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|