1
|
Sun S, Cheng F, Han D, Wei S, Zhong A, Massoudian S, Johnson AB. Pairwise comparative analysis of six haplotype assembly methods based on users' experience. BMC Genom Data 2023; 24:35. [PMID: 37386408 PMCID: PMC10311811 DOI: 10.1186/s12863-023-01134-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 05/25/2023] [Indexed: 07/01/2023] Open
Abstract
BACKGROUND A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. RESULT Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms' run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. CONCLUSION The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
Collapse
Affiliation(s)
- Shuying Sun
- Department of Mathematics, Texas State University, San Marcos, TX USA
| | - Flora Cheng
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Daphne Han
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Sarah Wei
- Massachusetts Institute of Technology, Cambridge, MA USA
| | | | | | | |
Collapse
|
2
|
Hu Y, Yang C, Zhang L, Zhou X. Haplotyping-Assisted Diploid Assembly and Variant Detection with Linked Reads. Methods Mol Biol 2023; 2590:161-182. [PMID: 36335499 DOI: 10.1007/978-1-0716-2819-5_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Phasing is essential for determining the origins of each set of alleles in the whole-genome sequencing data of individuals. As such, it provides essential information for the causes of hereditary diseases and the sources of individual variability. Recent technical breakthroughs in linked-read (referred to as co-barcoding in other chapters of the book) and long-read sequencing and downstream analysis have brought the goal of accurate and complete phasing within reach. Here we review recent progress related to the assembly and phasing of personal genomes based on linked-reads and related applications. Motivated by current limitations in generating high-quality diploid assemblies and detecting variants, a new suite of software tools, Aquila, was developed to fully take advantage of linked-read sequencing technology. The overarching goal of Aquila is to exploit the strengths of linked-read technology including long-range connectivity and inherent phasing of variants for reference-assisted local de novo assembly at the whole-genome scale. The diploid nature of the assemblies facilitates detection and phasing of genetic variation, including single nucleotide variations (SNVs), small insertions and deletions (indels), and structural variants (SVs). An extension of Aquila, Aquila_stLFR, focuses on another newly developed linked-reads sequencing technology, single-tube long-fragment read (stLFR). AquilaSV, a region-based diploid assembly approach, is used to characterize structural variants and can achieve diploid assembly in one target region at a time. Lastly, we introduce HAPDeNovo, a program that exploits phasing information from linked-read sequencing to improve detection of de novo mutations. Use of these tools is expected to harness the advantages of linked-reads technology, improve phasing, and advance variant discovery.
Collapse
Affiliation(s)
- Yunfei Hu
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong.
| | - Xin Zhou
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA.
- Data Science Institute, Nashville, TN, USA.
| |
Collapse
|
3
|
Huang J, Pallotti S, Zhou Q, Kleber M, Xin X, King DA, Napolioni V. PERHAPS: Paired-End short Reads-based HAPlotyping from next-generation Sequencing data. Brief Bioinform 2020; 22:6025504. [PMID: 33285565 DOI: 10.1093/bib/bbaa320] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 10/11/2020] [Accepted: 10/19/2020] [Indexed: 11/13/2022] Open
Abstract
The identification of rare haplotypes may greatly expand our knowledge in the genetic architecture of both complex and monogenic traits. To this aim, we developed PERHAPS (Paired-End short Reads-based HAPlotyping from next-generation Sequencing data), a new and simple approach to directly call haplotypes from short-read, paired-end Next Generation Sequencing (NGS) data. To benchmark this method, we considered the APOE classic polymorphism (*1/*2/*3/*4), since it represents one of the best examples of functional polymorphism arising from the haplotype combination of two Single Nucleotide Polymorphisms (SNPs). We leveraged the big Whole Exome Sequencing (WES) and SNP-array data obtained from the multi-ethnic UK BioBank (UKBB, N=48,855). By applying PERHAPS, based on piecing together the paired-end reads according to their FASTQ-labels, we extracted the haplotype data, along with their frequencies and the individual diplotype. Concordance rates between WES directly called diplotypes and the ones generated through statistical pre-phasing and imputation of SNP-array data are extremely high (>99%), either when stratifying the sample by SNP-array genotyping batch or self-reported ethnic group. Hardy-Weinberg Equilibrium tests and the comparison of obtained haplotype frequencies with the ones available from the 1000 Genome Project further supported the reliability of PERHAPS. Notably, we were able to determine the existence of the rare APOE*1 haplotype in two unrelated African subjects from UKBB, supporting its presence at appreciable frequency (approximatively 0.5%) in the African Yoruba population. Despite acknowledging some technical shortcomings, PERHAPS represents a novel and simple approach that will partly overcome the limitations in direct haplotype calling from short read-based sequencing.
Collapse
Affiliation(s)
- Jie Huang
- Department of Global Health, School of Public Health, Peking University, Beijing, China
| | - Stefano Pallotti
- Genetics and Animal Breeding Group, School of Pharmacy, University of Camerino, Italy
| | - Qianling Zhou
- Department of Maternal and Child Health, School of Public Health, Peking University, Beijing, China
| | - Marcus Kleber
- Department of Medicine, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany and at SYNLAB MVZ Humangenetik Mannheim, Mannheim, Germany
| | | | - Daniel A King
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Valerio Napolioni
- School of Biosciences and Veterinary Medicine, University of Camerino, Camerino, Italy
| |
Collapse
|
4
|
Zhang X, Wu R, Wang Y, Yu J, Tang H. Unzipping haplotypes in diploid and polyploid genomes. Comput Struct Biotechnol J 2019; 18:66-72. [PMID: 31908732 PMCID: PMC6938933 DOI: 10.1016/j.csbj.2019.11.011] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 11/25/2019] [Accepted: 11/26/2019] [Indexed: 11/18/2022] Open
Abstract
Diploid genomes consist of two homologous copies of chromosomes with one from each parent while polyploid genomes contain more than two homologous sets of chromosomes. Most of the reference genome assemblies collapsed haplotypes that represent 'mosaic' sequences, ignoring allelic variants that may be involved in important cellular and biological functions. Unzipping haplotypes into distinct sets of sequences has been a growing trend in recent genome studies, as it is an essential tool towards resolving important clinical and biological questions, such as compound heterozygotes, heterosis, and evolution. Herein, we review existing methods for alignment-based and assembly-based haplotype phasing for heterozygous diploid and polyploid genomes, as well as recent advances of experimental approaches for improved genome phasing. We anticipate that full haplotype phasing could become a routine procedure in genome studies in the near future.
Collapse
Affiliation(s)
- Xingtan Zhang
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Corps, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ruoxi Wu
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Corps, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Yibin Wang
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Corps, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Jiaxin Yu
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Corps, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Haibao Tang
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Corps, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Corresponding author.
| |
Collapse
|
5
|
Geng Y, Zhao Z, Liu J. [Reconstruction of tumor clonal haplotypes based on an improved spanning algorithm]. NAN FANG YI KE DA XUE XUE BAO = JOURNAL OF SOUTHERN MEDICAL UNIVERSITY 2019; 39:1287-1292. [PMID: 31852653 DOI: 10.12122/j.issn.1673-4254.2019.11.04] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE To reconstruct tumor clonal haplotypes based on the third-generation sequencing data to effectively identify tumor heterogeneity. METHODS We developed an algorithm for extracting somatic mutational event from the mixed tumor data and determining the connection weight of each somatic cell mutation site through the probability function. A reconstruction algorithm of the haplotype was designed based on the maximum spanning tree, and following the principle of inheritance between tumor clones, the connection pattern was determined at each mutation site in the clonal maximum spanning tree in a stepwise manner. The number, ratio and evolution of the sub-clones were estimated using the depth stripping method. RESULTS In the simulation experiments, we analyzed the accuracy of the algorithm based on 4 indexes, namely the coverage, read length, subclone number and somatic variant rate, and the Results demonstrated a good robustness of the algorithm. The Results of the experiments showed that the mean sub-clone haplotypes accuracy exceeded 97%, suggesting that this algorithm significantly outperformed the previous Methods. CONCLUSIONS The proposed method can accurately reconstruct tumor subclonal haplotypes and clarify the process of tumor clonal evolution, and can thus provide a theoretical basis for tumor heterogeneity research and assist in clinical decision-making.
Collapse
Affiliation(s)
- Yu Geng
- School of Health Management, Jinzhou Medical University, Jinzhou 121001, China.,School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
| | - Zhongmeng Zhao
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jianye Liu
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
6
|
Edge P, Bafna V, Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res 2016; 27:801-812. [PMID: 27940952 PMCID: PMC5411775 DOI: 10.1101/gr.213462.116] [Citation(s) in RCA: 199] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/08/2016] [Indexed: 11/24/2022]
Abstract
Many tools have been developed for haplotype assembly—the reconstruction of individual haplotypes using reads mapped to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes. However, existing computational methods designed to handle specific technologies do not scale well on data from different protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple sequencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types—dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation (Hi-C) sequencing—we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-resolution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromosomes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse sequencing technologies.
Collapse
Affiliation(s)
- Peter Edge
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA
| | - Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California 92053, USA
| |
Collapse
|
7
|
Chen ZZ, Deng F, Shen C, Wang Y, Wang L. Better ILP-Based Approaches to Haplotype Assembly. J Comput Biol 2016; 23:537-52. [DOI: 10.1089/cmb.2015.0035] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Affiliation(s)
- Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama, Japan
| | - Fei Deng
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Chao Shen
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Yiji Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
8
|
Xie M, Wang J, Chen X. LGH: A Fast and Accurate Algorithm for Single Individual Haplotyping Based on a Two-Locus Linkage Graph. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1255-1266. [PMID: 26671798 DOI: 10.1109/tcbb.2015.2430352] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Phased haplotype information is crucial in our complete understanding of differences between individuals at the genetic level. Given a collection of DNA fragments sequenced from a homologous pair of chromosomes, the problem of single individual haplotyping (SIH) aims to reconstruct a pair of haplotypes using a computer algorithm. In this paper, we encode the information of aligned DNA fragments into a two-locus linkage graph and approach the SIH problem by vertex labeling of the graph. In order to find a vertex labeling with the minimum sum of weights of incompatible edges, we develop a fast and accurate heuristic algorithm. It starts with detecting error-tolerant components by an adapted breadth-first search. A proper labeling of vertices is then identified for each component, with which sequencing errors are further corrected and edge weights are adjusted accordingly. After contracting each error-tolerant component into a single vertex, the above procedure is iterated on the resulting condensed linkage graph until error-tolerant components are no longer detected. The algorithm finally outputs a haplotype pair based on the vertex labeling. Extensive experiments on simulated and real data show that our algorithm is more accurate and faster than five existing algorithms for single individual haplotyping.
Collapse
|
9
|
Rhee JK, Li H, Joung JG, Hwang KB, Zhang BT, Shin SY. Survey of computational haplotype determination methods for single individual. Genes Genomics 2015. [DOI: 10.1007/s13258-015-0342-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
10
|
Ahn S, Vikalo H. Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm. BMC Bioinformatics 2015; 16:223. [PMID: 26178880 PMCID: PMC4503296 DOI: 10.1186/s12859-015-0651-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2014] [Accepted: 06/26/2015] [Indexed: 01/01/2023] Open
Abstract
Background Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes. A C code implementation of ParticleHap will be available for download from https://sites.google.com/site/asynoeun/particlehap.
Collapse
Affiliation(s)
- Soyeon Ahn
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, 78712, Texas, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, 78712, Texas, USA.
| |
Collapse
|
11
|
Abstract
Human genomes are diploid and, for their complete description and interpretation, it is necessary not only to discover the variation they contain but also to arrange it onto chromosomal haplotypes. Although whole-genome sequencing is becoming increasingly routine, nearly all such individual genomes are mostly unresolved with respect to haplotype, particularly for rare alleles, which remain poorly resolved by inferential methods. Here, we review emerging technologies for experimentally resolving (that is, 'phasing') haplotypes across individual whole-genome sequences. We also discuss computational methods relevant to their implementation, metrics for assessing their accuracy and completeness, and the relevance of haplotype information to applications of genome sequencing in research and clinical medicine.
Collapse
|
12
|
Abstract
Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.
Collapse
Affiliation(s)
- Gustavo Glusman
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| | - Hannah C Cox
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| | - Jared C Roach
- Institute for Systems Biology, Terry Avenue North, Seattle, WA 98109 USA
| |
Collapse
|
13
|
Abstract
MOTIVATION Accurate haplotyping-determining from which parent particular portions of the genome are inherited-is still mostly an unresolved problem in genomics. This problem has only recently started to become tractable, thanks to the development of new long read sequencing technologies. Here, we introduce ProbHap, a haplotyping algorithm targeted at such technologies. The main algorithmic idea of ProbHap is a new dynamic programming algorithm that exactly optimizes a likelihood function specified by a probabilistic graphical model and which generalizes a popular objective called the minimum error correction. In addition to being accurate, ProbHap also provides confidence scores at phased positions. RESULTS On a standard benchmark dataset, ProbHap makes 11% fewer errors than current state-of-the-art methods. This accuracy can be further increased by excluding low-confidence positions, at the cost of a small drop in haplotype completeness. AVAILABILITY Our source code is freely available at: https://github.com/kuleshov/ProbHap.
Collapse
Affiliation(s)
- Volodymyr Kuleshov
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
14
|
Matsumoto H, Kiryu H. Integrating dilution-based sequencing and population genotypes for single individual haplotyping. BMC Genomics 2014; 15:733. [PMID: 25167975 PMCID: PMC4162929 DOI: 10.1186/1471-2164-15-733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2013] [Accepted: 08/18/2014] [Indexed: 11/30/2022] Open
Abstract
Background Haplotype information is useful for many genetic analyses and haplotypes are usually inferred using computational approaches. Among such approaches, the importance of single individual haplotyping (SIH), which infers individual haplotypes from sequence fragments, has been increasing with the advent of novel sequencing techniques, such as dilution-based sequencing. These techniques could produce virtual long read fragments by separating DNA fragments into multiple low-concentration aliquots, sequencing and mapping each aliquot, and merging clustered short reads. Although these experimental techniques are sophisticated, they have the problem of producing chimeric fragments whose left and right parts match different chromosomes. In our previous research, we found that chimeric fragments significantly decrease the accuracy of SIH. Although chimeric fragments can be removed by using haplotypes which are determined from pedigree genotypes, pedigree genotypes are generally not available. The length of reads cluster and heterozygous calls were also used to detect chimeric fragments. Although some chimeric fragments will be removed with these features, considerable number of chimeric fragments will be undetected because of the dispersion of the length and the absence of SNPs in the overlapped regions. For these reasons, a general method to detect and remove chimeric fragments is needed. Results In this paper, we propose a general method to detect chimeric fragments. The basis of our method is that a chimeric fragment would correspond to an artificial recombinant haplotype and would differ from biological haplotypes. To detect differences from biological haplotypes, we integrated statistical phasing, which is a haplotype inference approach from population genotypes, into our method. We applied our method to two datasets and detected chimeric fragments with high AUC. AUC values of our method are higher than those of just using cluster length and heterozygous calls. We then used multiple SIH algorithm to compare the accuracy of SIH before and after removing the chimeric fragment candidates. The accuracy of assembled haplotypes increased significantly after removing chimeric fragment candidates. Conclusions Our method is useful for detecting chimeric fragments and improving SIH accuracy. The Ruby script is available at
https://sites.google.com/site/hmatsu1226/software/csp. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-733) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hirotaka Matsumoto
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, Chiba 277-8561, Japan.
| | | |
Collapse
|
15
|
Tretyakov K, Goldberg T, Jin VX, Horton P. Summary of talks and papers at ISCB-Asia/SCCG 2012. BMC Genomics 2013. [PMCID: PMC3639071 DOI: 10.1186/1471-2164-14-s2-i1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
The second ISCB-Asia conference of the International Society for Computational Biology took place December 17-19, 2012, in Shenzhen, China. The conference was co-hosted by BGI as the first Shenzhen Conference on Computational Genomics (SCCG).
45 talks were presented at ISCB-Asia/SCCG 2012. The topics covered included software tools, reproducible computing, next-generation sequencing data analysis, transcription and mRNA regulation, protein structure and function, cancer genomics and personalized medicine. Nine of the proceedings track talks are included as full papers in this supplement.
In this report we first give a short overview of the conference by listing some statistics and visualizing the talk abstracts as word clouds. Then we group the talks by topic and briefly summarize each one, providing references to related publications whenever possible. Finally, we close with a few comments on the success of this conference.
Collapse
|