1
|
Sankararaman A, Vikalo H, Baccelli F. ComHapDet: a spatial community detection algorithm for haplotype assembly. BMC Genomics 2020; 21:586. [PMID: 32900369 PMCID: PMC7488034 DOI: 10.1186/s12864-020-06935-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. RESULTS We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. CONCLUSIONS Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Collapse
Affiliation(s)
- Abishek Sankararaman
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - François Baccelli
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.,Department of Mathematics, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
2
|
Igarashi K, Funakoshi M, Kato S, Moriwaki T, Kato Y, Zhang-Akiyama QM. CiApex1 has AP endonuclease activity and abrogated AP site repair disrupts early embryonic development in Ciona intestinalis. Genes Genet Syst 2019; 94:81-93. [PMID: 30930342 DOI: 10.1266/ggs.18-00043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Apurinic/apyrimidinic (AP) sites are the most common form of cytotoxic DNA damage. Since AP sites inhibit DNA replication and transcription, repairing them is critical for cell growth. However, the significance of repairing AP sites during early embryonic development has not yet been clearly determined. Here, we focused on APEX1 from the ascidian Ciona intestinalis (CiApex1), a homolog of human AP endonuclease 1 (APEX1), and examined its role in early embryonic development. Recombinant CiApex1 protein complemented the drug sensitivities of an AP endonuclease-deficient Escherichia coli mutant, and exhibited Mg2+-dependent AP endonuclease activity, like human APEX1, in vitro. Next, the effects of abnormal AP site repair on embryonic development were investigated. Treatment with methyl methanesulfonate, which alkylates DNA bases and generates AP sites, induced abnormal embryonic development. This abnormal phenotype was also caused by treatment with methoxyamine, which inhibits AP endonuclease activity. Furthermore, we constructed dominant-negative CiApex1, which inhibits CiApex1 action, and found that its expression impaired embryonic growth. These results suggested that AP site repair is essential for embryonic development and CiApex1 plays an important role in AP site repair during early embryonic development in C. intestinalis.
Collapse
Affiliation(s)
- Kento Igarashi
- Laboratory of Stress Response Biology, Department of Biological Sciences, Graduate School of Science, Kyoto University.,Department of Applied Pharmacology, Graduate School of Medical and Dental Sciences, Kagoshima University
| | - Masafumi Funakoshi
- Laboratory of Stress Response Biology, Department of Biological Sciences, Graduate School of Science, Kyoto University
| | - Seiji Kato
- Laboratory of Stress Response Biology, Department of Biological Sciences, Graduate School of Science, Kyoto University
| | - Takahito Moriwaki
- Laboratory of Stress Response Biology, Department of Biological Sciences, Graduate School of Science, Kyoto University.,Department of Stem Cell Biology, Atomic Bomb Disease Institute, Nagasaki University
| | - Yuichi Kato
- Laboratory of Stress Response Biology, Department of Biological Sciences, Graduate School of Science, Kyoto University.,Engineering Biology Research Center, Kobe University
| | - Qiu-Mei Zhang-Akiyama
- Laboratory of Stress Response Biology, Department of Biological Sciences, Graduate School of Science, Kyoto University
| |
Collapse
|
3
|
Abstract
BACKGROUND Haplotype assembly is the task of reconstructing haplotypes of an individual from a mixture of sequenced chromosome fragments. Haplotype information enables studies of the effects of genetic variations on an organism's phenotype. Most of the mathematical formulations of haplotype assembly are known to be NP-hard and haplotype assembly becomes even more challenging as the sequencing technology advances and the length of the paired-end reads and inserts increases. Assembly of haplotypes polyploid organisms is considerably more difficult than in the case of diploids. Hence, scalable and accurate schemes with provable performance are desired for haplotype assembly of both diploid and polyploid organisms. RESULTS We propose a framework that formulates haplotype assembly from sequencing data as a sparse tensor decomposition. We cast the problem as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and [Formula: see text]; tensor [Formula: see text] reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by iteratively solving this decomposition problem is proposed. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. The developed framework is applicable to diploid, biallelic and polyallelic polyploid species. The code for AltHap is freely available from https://github.com/realabolfazl/AltHap . CONCLUSION AltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.
Collapse
Affiliation(s)
- Abolfazl Hashemi
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| | - Banghua Zhu
- EE Department, Tsinghua University, Beijing, China
| | - Haris Vikalo
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
4
|
Abstract
Cardiac cell specification and the genetic determinants that govern this process are highly conserved among Chordates. Recent studies have established the importance of evolutionarily-conserved mechanisms in the study of congenital heart defects and disease, as well as cardiac regeneration. As a basal Chordate, the Ciona model system presents a simple scaffold that recapitulates the basic blueprint of cardiac development in Chordates. Here we will focus on the development and cellular structure of the heart of the ascidian Ciona as compared to other Chordates, principally vertebrates. Comparison of the Ciona model system to heart development in other Chordates presents great potential for dissecting the genetic mechanisms that underlie congenital heart defects and disease at the cellular level and might provide additional insight into potential pathways for therapeutic cardiac regeneration.
Collapse
|
5
|
Zhao X, Emery SB, Myers B, Kidd JM, Mills RE. Resolving complex structural genomic rearrangements using a randomized approach. Genome Biol 2016; 17:126. [PMID: 27287201 PMCID: PMC4901421 DOI: 10.1186/s13059-016-0993-1] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 05/25/2016] [Indexed: 12/27/2022] Open
Abstract
Complex chromosomal rearrangements are structural genomic alterations involving multiple instances of deletions, duplications, inversions, or translocations that co-occur either on the same chromosome or represent different overlapping events on homologous chromosomes. We present SVelter, an algorithm that identifies regions of the genome suspected to harbor a complex event and then resolves the structure by iteratively rearranging the local genome structure, in a randomized fashion, with each structure scored against characteristics of the observed sequencing data. SVelter is able to accurately reconstruct complex chromosomal rearrangements when compared to well-characterized genomes that have been deeply sequenced with both short and long reads.
Collapse
Affiliation(s)
- Xuefang Zhao
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Sarah B Emery
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Bridget Myers
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Jeffrey M Kidd
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA.,Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Ryan E Mills
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA. .,Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
6
|
Xing Q, Yu Q, Dou H, Wang J, Li R, Ning X, Wang R, Wang S, Zhang L, Hu X, Bao Z. Genome-wide identification, characterization and expression analyses of two TNFRs in Yesso scallop (Patinopecten yessoensis) provide insight into the disparity of responses to bacterial infections and heat stress in bivalves. FISH & SHELLFISH IMMUNOLOGY 2016; 52:44-56. [PMID: 26988286 DOI: 10.1016/j.fsi.2016.03.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Revised: 01/28/2016] [Accepted: 03/10/2016] [Indexed: 05/16/2023]
Abstract
Tumor necrosis factors receptors (TNFRs) comprise a superfamily of proteins characterized by a unique cysteine-rich domain (CRD) and play important roles in diverse physiological and pathological processes in the innate immune system, including inflammation, apoptosis, autoimmunity and organogenesis. Although significant effects of TNFRs on immunity have been reported in most vertebrates as well as some invertebrates, the complete TNFR superfamily has not been systematically characterized in scallops. In this study, two different types of TNFR-like genes, including PyTNFR1 and PyTNFR2 genes were identified from Yesso scallop (Patinopecten yessoensis, Jay, 1857) through whole-genome scanning. Phylogenetic and protein structural analyses were carried out to determine the identities and evolutionary relationships of the two genes. The expression profiling of PyTNFRs was performed at different development stages, in healthy adult tissues and in hemocytes after bacterial infection and heat stress. Expression analysis revealed that both PyTNFRs were significantly induced during the acute phase (3 h) after infection with Gram-positive (Micrococcus luteus) and Gram-negative (Vibrio anguillarum) bacteria, though much more dramatic chronic-phase (24 h) changes were observed after V. anguillarum challenge. For heat stress, only PyTNFR2 displayed significant elevation at 12 h and 24 h, which suggests a functional difference in the two PyTNFRs. Collectively, this study provides novel insight into the PyTNFRs and the specific role and response of TNFR-involved pathways in host immune responses against different bacterial pathogens and heat stress in bivalves.
Collapse
Affiliation(s)
- Qiang Xing
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Qian Yu
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Huaiqian Dou
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Jing Wang
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Ruojiao Li
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Xianhui Ning
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Ruijia Wang
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China.
| | - Shi Wang
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Lingling Zhang
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China.
| | - Xiaoli Hu
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| | - Zhenmin Bao
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao 266003, China
| |
Collapse
|
7
|
Puljiz Z, Vikalo H. Decoding Genetic Variations: Communications-Inspired Haplotype Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:518-530. [PMID: 27295635 DOI: 10.1109/tcbb.2015.2462367] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
High-throughput DNA sequencing technologies allow fast and affordable sequencing of individual genomes and thus enable unprecedented studies of genetic variations. Information about variations in the genome of an individual is provided by haplotypes, ordered collections of single nucleotide polymorphisms. Knowledge of haplotypes is instrumental in finding genes associated with diseases, drug development, and evolutionary studies. Haplotype assembly from high-throughput sequencing data is challenging due to errors and limited lengths of sequencing reads. The key observation made in this paper is that the minimum error-correction formulation of the haplotype assembly problem is identical to the task of deciphering a coded message received over a noisy channel-a classical problem in the mature field of communication theory. Exploiting this connection, we develop novel haplotype assembly schemes that rely on the bit-flipping and belief propagation algorithms often used in communication systems. The latter algorithm is then adapted to the haplotype assembly of polyploids. We demonstrate on both simulated and experimental data that the proposed algorithms compare favorably with state-of-the-art haplotype assembly methods in terms of accuracy, while being scalable and computationally efficient.
Collapse
|
8
|
José-Edwards DS, Oda-Ishii I, Kugler JE, Passamaneck YJ, Katikala L, Nibu Y, Di Gregorio A. Brachyury, Foxa2 and the cis-Regulatory Origins of the Notochord. PLoS Genet 2015; 11:e1005730. [PMID: 26684323 PMCID: PMC4684326 DOI: 10.1371/journal.pgen.1005730] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Accepted: 11/16/2015] [Indexed: 11/18/2022] Open
Abstract
A main challenge of modern biology is to understand how specific constellations of genes are activated to differentiate cells and give rise to distinct tissues. This study focuses on elucidating how gene expression is initiated in the notochord, an axial structure that provides support and patterning signals to embryos of humans and all other chordates. Although numerous notochord genes have been identified, the regulatory DNAs that orchestrate development and propel evolution of this structure by eliciting notochord gene expression remain mostly uncharted, and the information on their configuration and recurrence is still quite fragmentary. Here we used the simple chordate Ciona for a systematic analysis of notochord cis-regulatory modules (CRMs), and investigated their composition, architectural constraints, predictive ability and evolutionary conservation. We found that most Ciona notochord CRMs relied upon variable combinations of binding sites for the transcription factors Brachyury and/or Foxa2, which can act either synergistically or independently from one another. Notably, one of these CRMs contains a Brachyury binding site juxtaposed to an (AC) microsatellite, an unusual arrangement also found in Brachyury-bound regulatory regions in mouse. In contrast, different subsets of CRMs relied upon binding sites for transcription factors of widely diverse families. Surprisingly, we found that neither intra-genomic nor interspecific conservation of binding sites were reliably predictive hallmarks of notochord CRMs. We propose that rather than obeying a rigid sequence-based cis-regulatory code, most notochord CRMs are rather unique. Yet, this study uncovered essential elements recurrently used by divergent chordates as basic building blocks for notochord CRMs.
Collapse
Affiliation(s)
- Diana S. José-Edwards
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
| | - Izumi Oda-Ishii
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
| | - Jamie E. Kugler
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
| | - Yale J. Passamaneck
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
| | - Lavanya Katikala
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
| | - Yutaka Nibu
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
| | - Anna Di Gregorio
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, New York, United States of America
- * E-mail:
| |
Collapse
|
9
|
Ahn S, Vikalo H. Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm. BMC Bioinformatics 2015; 16:223. [PMID: 26178880 PMCID: PMC4503296 DOI: 10.1186/s12859-015-0651-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2014] [Accepted: 06/26/2015] [Indexed: 01/01/2023] Open
Abstract
Background Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes. A C code implementation of ParticleHap will be available for download from https://sites.google.com/site/asynoeun/particlehap.
Collapse
Affiliation(s)
- Soyeon Ahn
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, 78712, Texas, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, 78712, Texas, USA.
| |
Collapse
|
10
|
Abstract
Background Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking. Methods We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups. Results We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads). Conclusion More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.
Collapse
|
11
|
Das S, Vikalo H. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics 2015; 16:260. [PMID: 25885901 PMCID: PMC4422552 DOI: 10.1186/s12864-015-1408-5] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 02/27/2015] [Indexed: 11/30/2022] Open
Abstract
Background The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed. Results We develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap. Conclusion Extensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided.
Collapse
Affiliation(s)
- Shreepriya Das
- Department of ECE, The University of Texas at Austin, Austin, Austin, USA.
| | - Haris Vikalo
- Department of ECE, The University of Texas at Austin, Austin, Austin, USA.
| |
Collapse
|
12
|
Abstract
MOTIVATION Accurate haplotyping-determining from which parent particular portions of the genome are inherited-is still mostly an unresolved problem in genomics. This problem has only recently started to become tractable, thanks to the development of new long read sequencing technologies. Here, we introduce ProbHap, a haplotyping algorithm targeted at such technologies. The main algorithmic idea of ProbHap is a new dynamic programming algorithm that exactly optimizes a likelihood function specified by a probabilistic graphical model and which generalizes a popular objective called the minimum error correction. In addition to being accurate, ProbHap also provides confidence scores at phased positions. RESULTS On a standard benchmark dataset, ProbHap makes 11% fewer errors than current state-of-the-art methods. This accuracy can be further increased by excluding low-confidence positions, at the cost of a small drop in haplotype completeness. AVAILABILITY Our source code is freely available at: https://github.com/kuleshov/ProbHap.
Collapse
Affiliation(s)
- Volodymyr Kuleshov
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
13
|
Matsumoto H, Kiryu H. Integrating dilution-based sequencing and population genotypes for single individual haplotyping. BMC Genomics 2014; 15:733. [PMID: 25167975 PMCID: PMC4162929 DOI: 10.1186/1471-2164-15-733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2013] [Accepted: 08/18/2014] [Indexed: 11/30/2022] Open
Abstract
Background Haplotype information is useful for many genetic analyses and haplotypes are usually inferred using computational approaches. Among such approaches, the importance of single individual haplotyping (SIH), which infers individual haplotypes from sequence fragments, has been increasing with the advent of novel sequencing techniques, such as dilution-based sequencing. These techniques could produce virtual long read fragments by separating DNA fragments into multiple low-concentration aliquots, sequencing and mapping each aliquot, and merging clustered short reads. Although these experimental techniques are sophisticated, they have the problem of producing chimeric fragments whose left and right parts match different chromosomes. In our previous research, we found that chimeric fragments significantly decrease the accuracy of SIH. Although chimeric fragments can be removed by using haplotypes which are determined from pedigree genotypes, pedigree genotypes are generally not available. The length of reads cluster and heterozygous calls were also used to detect chimeric fragments. Although some chimeric fragments will be removed with these features, considerable number of chimeric fragments will be undetected because of the dispersion of the length and the absence of SNPs in the overlapped regions. For these reasons, a general method to detect and remove chimeric fragments is needed. Results In this paper, we propose a general method to detect chimeric fragments. The basis of our method is that a chimeric fragment would correspond to an artificial recombinant haplotype and would differ from biological haplotypes. To detect differences from biological haplotypes, we integrated statistical phasing, which is a haplotype inference approach from population genotypes, into our method. We applied our method to two datasets and detected chimeric fragments with high AUC. AUC values of our method are higher than those of just using cluster length and heterozygous calls. We then used multiple SIH algorithm to compare the accuracy of SIH before and after removing the chimeric fragment candidates. The accuracy of assembled haplotypes increased significantly after removing chimeric fragment candidates. Conclusions Our method is useful for detecting chimeric fragments and improving SIH accuracy. The Ruby script is available at
https://sites.google.com/site/hmatsu1226/software/csp. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-733) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hirotaka Matsumoto
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, Chiba 277-8561, Japan.
| | | |
Collapse
|
14
|
Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, Wilkie AOM, McVean G, Lunter G. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 2014; 46:912-918. [PMID: 25017105 PMCID: PMC4753679 DOI: 10.1038/ng.3036] [Citation(s) in RCA: 709] [Impact Index Per Article: 70.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Accepted: 06/23/2014] [Indexed: 12/19/2022]
Abstract
High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.
Collapse
Affiliation(s)
- Andy Rimmer
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Hang Phan
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Iain Mathieson
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Stephen R F Twigg
- Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, UK
| | - Andrew O M Wilkie
- Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, UK
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Department of Statistics, University of Oxford, Oxford, UK
| | - Gerton Lunter
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| |
Collapse
|
15
|
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 2014; 24:1384-95. [PMID: 24755901 PMCID: PMC4120091 DOI: 10.1101/gr.170720.113] [Citation(s) in RCA: 743] [Impact Index Per Article: 74.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.
Collapse
Affiliation(s)
- Rei Kajitani
- Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan
| | - Kouta Toshimoto
- Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan; AXIOHELIX Co. Ltd., Chuo-ku, Tokyo 103-0015, Japan
| | - Hideki Noguchi
- Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Atsushi Toyoda
- Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan; Center for Information Biology, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Yoshitoshi Ogura
- Division of Microbial Genomics, Frontier Science Research Center, University of Miyazaki, Miyazaki 889-1692, Japan; Division of Microbiology, Faculty of Medicine, University of Miyazaki, Miyazaki 889-1692, Japan
| | - Miki Okuno
- Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan
| | - Mitsuru Yabana
- Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan
| | - Masayuki Harada
- Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan
| | - Eiji Nagayasu
- Division of Parasitology, Faculty of Medicine, University of Miyazaki, Miyazaki 889-1692, Japan
| | - Haruhiko Maruyama
- Division of Parasitology, Faculty of Medicine, University of Miyazaki, Miyazaki 889-1692, Japan
| | - Yuji Kohara
- Genetic Strains Research Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Asao Fujiyama
- Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan; Center for Information Biology, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Tetsuya Hayashi
- Division of Microbial Genomics, Frontier Science Research Center, University of Miyazaki, Miyazaki 889-1692, Japan; Division of Microbiology, Faculty of Medicine, University of Miyazaki, Miyazaki 889-1692, Japan
| | - Takehiko Itoh
- Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan
| |
Collapse
|
16
|
Sequencing, assembling, and correcting draft genomes using recombinant populations. G3-GENES GENOMES GENETICS 2014; 4:669-79. [PMID: 24531727 PMCID: PMC4059239 DOI: 10.1534/g3.114.010264] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Current de novo whole-genome sequencing approaches often are inadequate for organisms lacking substantial preexisting genetic data. Problems with these methods are manifest as: large numbers of scaffolds that are not ordered within chromosomes or assigned to individual chromosomes, misassembly of allelic sequences as separate loci when the individual(s) being sequenced are heterozygous, and the collapse of recently duplicated sequences into a single locus, regardless of levels of heterozygosity. Here we propose a new approach for producing de novo whole-genome sequences—which we call recombinant population genome construction—that solves many of the problems encountered in standard genome assembly and that can be applied in model and nonmodel organisms. Our approach takes advantage of next-generation sequencing technologies to simultaneously barcode and sequence a large number of individuals from a recombinant population. The sequences of all recombinants can be combined to create an initial de novo assembly, followed by the use of individual recombinant genotypes to correct assembly splitting/collapsing and to order and orient scaffolds within linkage groups. Recombinant population genome construction can rapidly accelerate the transformation of nonmodel species into genome-enabled systems by simultaneously producing a high-quality genome assembly and providing genomic tools (e.g., high-confidence single-nucleotide polymorphisms) for immediate applications. In populations segregating for important functional traits, this approach also enables simultaneous mapping of quantitative trait loci. We demonstrate our method using simulated Illumina data from a recombinant population of Caenorhabditis elegans and show that the method can produce a high-fidelity, high-quality genome assembly for both parents of the cross.
Collapse
|
17
|
Holland LZ. Genomics, evolution and development of amphioxus and tunicates: The Goldilocks principle. JOURNAL OF EXPERIMENTAL ZOOLOGY PART B-MOLECULAR AND DEVELOPMENTAL EVOLUTION 2014; 324:342-52. [DOI: 10.1002/jez.b.22569] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/25/2013] [Revised: 01/29/2014] [Accepted: 02/27/2014] [Indexed: 11/10/2022]
Affiliation(s)
- Linda Z. Holland
- Marine Biology Research Division; Scripps Institution of Oceanography; University of California San Diego; La Jolla California 92093-0202 USA
| |
Collapse
|
18
|
Parallel evolution of chordate cis-regulatory code for development. PLoS Genet 2013; 9:e1003904. [PMID: 24282393 PMCID: PMC3836708 DOI: 10.1371/journal.pgen.1003904] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Accepted: 09/09/2013] [Indexed: 12/17/2022] Open
Abstract
Urochordates are the closest relatives of vertebrates and at the larval stage, possess a characteristic bilateral chordate body plan. In vertebrates, the genes that orchestrate embryonic patterning are in part regulated by highly conserved non-coding elements (CNEs), yet these elements have not been identified in urochordate genomes. Consequently the evolution of the cis-regulatory code for urochordate development remains largely uncharacterised. Here, we use genome-wide comparisons between C. intestinalis and C. savignyi to identify putative urochordate cis-regulatory sequences. Ciona conserved non-coding elements (ciCNEs) are associated with largely the same key regulatory genes as vertebrate CNEs. Furthermore, some of the tested ciCNEs are able to activate reporter gene expression in both zebrafish and Ciona embryos, in a pattern that at least partially overlaps that of the gene they associate with, despite the absence of sequence identity. We also show that the ability of a ciCNE to up-regulate gene expression in vertebrate embryos can in some cases be localised to short sub-sequences, suggesting that functional cross-talk may be defined by small regions of ancestral regulatory logic, although functional sub-sequences may also be dispersed across the whole element. We conclude that the structure and organisation of cis-regulatory modules is very different between vertebrates and urochordates, reflecting their separate evolutionary histories. However, functional cross-talk still exists because the same repertoire of transcription factors has likely guided their parallel evolution, exploiting similar sets of binding sites but in different combinations. Vertebrates share many aspects of early development with our closest chordate ancestors, the tunicates. However, whilst the repertoire of genes that orchestrate development is essentially the same in the two lineages, the genomic code that regulates these genes appears to be very different, even though it is highly conserved within vertebrates themselves. Using comparative genomics, we have identified a parallel developmental code in tunicates and confirmed that this code, despite a lack of sequence conservation, associates with a similar repertoire of genes. However, the organisation of the code spatially is very different in the two lineages, strongly suggesting that most of it arose independently in vertebrates and tunicates, and in most cases lacking any direct sequence ancestry. We have assayed elements of the tunicate code, and found that at least some of them can regulate gene expression in zebrafish embryos. Our results suggest that regulatory code has arisen independently in different animal lineages but possesses some common functionality because its evolution has been driven by a similar cohort of developmental transcription factors. Our work helps illuminate how complex, stable gene regulatory networks evolve and become fixed within lineages.
Collapse
|
19
|
Cutter AD, Jovelin R, Dey A. Molecular hyperdiversity and evolution in very large populations. Mol Ecol 2013; 22:2074-95. [PMID: 23506466 PMCID: PMC4065115 DOI: 10.1111/mec.12281] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Revised: 01/24/2013] [Accepted: 01/29/2013] [Indexed: 02/06/2023]
Abstract
The genomic density of sequence polymorphisms critically affects the sensitivity of inferences about ongoing sequence evolution, function and demographic history. Most animal and plant genomes have relatively low densities of polymorphisms, but some species are hyperdiverse with neutral nucleotide heterozygosity exceeding 5%. Eukaryotes with extremely large populations, mimicking bacterial and viral populations, present novel opportunities for studying molecular evolution in sexually reproducing taxa with complex development. In particular, hyperdiverse species can help answer controversial questions about the evolution of genome complexity, the limits of natural selection, modes of adaptation and subtleties of the mutation process. However, such systems have some inherent complications and here we identify topics in need of theoretical developments. Close relatives of the model organisms Caenorhabditis elegans and Drosophila melanogaster provide known examples of hyperdiverse eukaryotes, encouraging functional dissection of resulting molecular evolutionary patterns. We recommend how best to exploit hyperdiverse populations for analysis, for example, in quantifying the impact of noncrossover recombination in genomes and for determining the identity and micro-evolutionary selective pressures on noncoding regulatory elements.
Collapse
Affiliation(s)
- Asher D Cutter
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada.
| | | | | |
Collapse
|
20
|
Abstract
BACKGROUND Haplotype information is useful for various genetic analyses, including genome-wide association studies. Determining haplotypes experimentally is difficult and there are several computational approaches that infer haplotypes from genomic data. Among such approaches, single individual haplotyping or haplotype assembly, which infers two haplotypes of an individual from aligned sequence fragments, has been attracting considerable attention. To avoid incorrect results in downstream analyses, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. Although there are several efficient algorithms for solving haplotype assembly, there are no efficient method that allow for extracting the regions assembled with high confidence. RESULTS We develop a probabilistic model, called MixSIH, for solving the haplotype assembly problem. The model has two mixture components representing two haplotypes. Based on the optimized model, a quality score is defined, which we call the 'minimum connectivity' (MC) score, for each segment in the haplotype assembly. Because existing accuracy measures for haplotype assembly are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the set of partially assembled haplotype segments, we develop an accuracy measure based on the pairwise consistency and evaluate the accuracy on the simulation and real data. By using the MC scores, our algorithm can extract highly accurate haplotype segments. We also show evidence that an existing experimental dataset contains chimeric read fragments derived from different haplotypes, which significantly degrade the quality of assembled haplotypes. CONCLUSIONS We develop a novel method for solving the haplotype assembly problem. We also define the quality score which is based on our model and indicates the accuracy of the haplotypes segments. In our evaluation, MixSIH has successfully extracted reliable haplotype segments. The C++ source code of MixSIH is available at https://sites.google.com/site/hmatsu1226/software/mixsih.
Collapse
Affiliation(s)
- Hirotaka Matsumoto
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan.
| | | |
Collapse
|
21
|
Stolfi A, Christiaen L. Genetic and genomic toolbox of the chordate Ciona intestinalis. Genetics 2012; 192:55-66. [PMID: 22964837 PMCID: PMC3430545 DOI: 10.1534/genetics.112.140590] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2012] [Accepted: 04/30/2012] [Indexed: 02/01/2023] Open
Abstract
The experimental malleability and unique phylogenetic position of the sea squirt Ciona intestinalis as part of the sister group to the vertebrates have helped establish these marine chordates as model organisms for the study of developmental genetics and evolution. Here we summarize the tools, techniques, and resources available to the Ciona geneticist, citing examples of studies that employed such strategies in the elucidation of gene function in Ciona. Genetic screens, germline transgenesis, electroporation of plasmid DNA, and microinjection of morpholinos are all routinely employed, and in the near future we expect these to be complemented by targeted mutagenesis, homologous recombination, and RNAi. The genomic resources available will continue to support the design and interpretation of genetic experiments and allow for increasingly sophisticated approaches on a high-throughput, whole-genome scale.
Collapse
Affiliation(s)
- Alberto Stolfi
- Center for Developmental Genetics, Department of Biology, New York University, New York, New York 10003, USA.
| | | |
Collapse
|
22
|
Beaster-Jones L. Cis-regulation and conserved non-coding elements in amphioxus. Brief Funct Genomics 2012; 11:118-30. [DOI: 10.1093/bfgp/els006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
23
|
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 2012; 44:226-32. [PMID: 22231483 PMCID: PMC3272472 DOI: 10.1038/ng.1028] [Citation(s) in RCA: 352] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2011] [Accepted: 11/07/2011] [Indexed: 12/24/2022]
Abstract
Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.
Collapse
Affiliation(s)
- Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK
| | - Mario Caccamo
- The Genome Analysis Centre, Norwich Research Park, Norwich, NR4 7UH, UK
| | - Isaac Turner
- Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK
| | - Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK
- Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK
| |
Collapse
|
24
|
Carnevali P, Baccash J, Halpern AL, Nazarenko I, Nilsen GB, Pant KP, Ebert JC, Brownley A, Morenzoni M, Karpinchyk V, Martin B, Ballinger DG, Drmanac R. Computational techniques for human genome resequencing using mated gapped reads. J Comput Biol 2011; 19:279-92. [PMID: 22175250 DOI: 10.1089/cmb.2011.0201] [Citation(s) in RCA: 85] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.
Collapse
Affiliation(s)
- Paolo Carnevali
- Complete Genomics Inc., Mountain View, California 94043, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Kim JH, Kim WC, Li LM, Park S. HapEdit: an accuracy assessment viewer for haplotype assembly using massively parallel DNA-sequencing technologies. Nucleic Acids Res 2011; 39:W557-61. [PMID: 21576217 PMCID: PMC3125762 DOI: 10.1093/nar/gkr354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The massively parallel sequencing technologies have recently flourished and dramatically cut the cost to sequence personal human genomes. Haplotype assembly from personal genomes sequenced using the massively parallel sequencing technologies is becoming a cost-effective and promising tool for human disease study. Computational assembly of haplotypes has been proved to be very accurate, but obviously contains errors. Here we present a tool, HapEdit, to assess the accuracy of assembled haplotypes and edit them manually. Using this tool, a user can break erroneous haplotype segments into smaller segments, or concatenate haplotype segments if the concatenated haplotype segments are sufficiently supported. A user can also edit bases with low-quality scores. HapEdit displays haplotype assemblies so that a user can easily navigate and pinpoint a region of interest. As inputs, HapEdit currently takes reads from the Polonator, Illumina, SOLiD, 454 and Sanger sequencing technologies.
Collapse
Affiliation(s)
- Jong Hyun Kim
- Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA
| | | | | | | |
Collapse
|
26
|
Abstract
Contemporary sequencing studies often ignore the diploid nature of the human genome because they do not routinely separate or 'phase' maternally and paternally derived sequence information. However, many findings - both from recent studies and in the more established medical genetics literature - indicate that relationships between human DNA sequence and phenotype, including disease, can be more fully understood with phase information. Thus, the existing technological impediments to obtaining phase information must be overcome if human genomics is to reach its full potential.
Collapse
|
27
|
Abstract
Ascidians, such as Ciona, are invertebrate chordates with simple embryonic body plans and small, relatively non-redundant genomes. Ciona genetics is in its infancy compared to many other model systems, but it provides a powerful method for studying this important vertebrate outgroup. Here we give basic methods for genetic analysis of Ciona, including protocols for controlled crosses both by natural spawning and by the surgical isolation of gametes; the identification and propagation of mutant lines; and strategies for positional cloning.
Collapse
Affiliation(s)
- Michael T Veeman
- Department of Molecular, Cell and Developmental Biology, University of California Santa Barbara, Santa Barbara, CA, USA.
| | | | | |
Collapse
|
28
|
Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol 2010; 29:59-63. [PMID: 21170042 DOI: 10.1038/nbt.1740] [Citation(s) in RCA: 184] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2010] [Accepted: 11/29/2010] [Indexed: 11/08/2022]
Abstract
Haplotype information is essential to the complete description and interpretation of genomes, genetic diversity and genetic ancestry. Although individual human genome sequencing is increasingly routine, nearly all such genomes are unresolved with respect to haplotype. Here we combine the throughput of massively parallel sequencing with the contiguity information provided by large-insert cloning to experimentally determine the haplotype-resolved genome of a South Asian individual. A single fosmid library was split into a modest number of pools, each providing ∼3% physical coverage of the diploid genome. Sequencing of each pool yielded reads overwhelmingly derived from only one homologous chromosome at any given location. These data were combined with whole-genome shotgun sequence to directly phase 94% of ascertained heterozygous single nucleotide polymorphisms (SNPs) into long haplotype blocks (N50 of 386 kilobases (kbp)). This method also facilitates the analysis of structural variation, for example, to anchor novel insertions to specific locations and haplotypes.
Collapse
|
29
|
Tsagkogeorga G, Turon X, Galtier N, Douzery EJP, Delsuc F. Accelerated evolutionary rate of housekeeping genes in tunicates. J Mol Evol 2010; 71:153-67. [PMID: 20697701 DOI: 10.1007/s00239-010-9372-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2010] [Accepted: 07/16/2010] [Indexed: 01/11/2023]
Abstract
Phylogenomics has recently revealed that tunicates represent the sister-group of vertebrates in the newly defined clade Olfactores. However, phylogenomic and comparative genomic studies have also suggested that tunicates are characterized by an elevated rate of molecular evolution and a high degree of genomic divergence. Despite the recurrent interest in the group, the picture of tunicate peculiar evolutionary dynamics is still fragmentary, as it mainly lies in studies focusing on only a few model species. In order to expand the available genomic data for the group, we used the high-throughput 454 technology to sequence the partial transcriptome of a previously unsampled tunicate, Microcosmus squamiger. This allowed us to get further insights into tunicate-accelerated evolution through a comparative analysis based on pertinent phylogenetic markers, i.e., a core of 35 housekeeping genes conserved across bilaterians. Our results showed that tunicates evolved on average about two times faster than the other chordates, yet the degree of this acceleration varied extensively upon genes and upon lineages. Appendicularia and Aplousobranchia were detected as the most divergent groups which were also characterized by highly heterogeneous substitution rates across genes. Finally, an estimation of the d (N)/d (S) ratio in three pairs of closely related taxa within Olfactores did not reveal strong differences between the tunicate and vertebrate lineages suggesting that for this set of housekeeping genes, the accelerated evolution of tunicates is plausibly due to an elevated mutation rate rather than to particular selective effects.
Collapse
Affiliation(s)
- Georgia Tsagkogeorga
- Université Montpellier 2 and CNRS, Institut des Sciences de l'Evolution (UMR 5554), CC064, Place Eugène Bataillon, 34095, Montpellier Cedex 05, France
| | | | | | | | | |
Collapse
|
30
|
Smith JJ, Saha NR, Amemiya CT. Genome biology of the cyclostomes and insights into the evolutionary biology of vertebrate genomes. Integr Comp Biol 2010; 50:130-7. [PMID: 21558194 PMCID: PMC3140258 DOI: 10.1093/icb/icq023] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
The jawless vertebrates (lamprey and hagfish) are the closest extant outgroups to all jawed vertebrates (gnathostomes) and can therefore provide critical insight into the evolution and basic biology of vertebrate genomes. As such, it is notable that the genomes of lamprey and hagfish possess a capacity for rearrangement that is beyond anything known from the gnathostomes. Like the jawed vertebrates, lamprey and hagfish undergo rearrangement of adaptive immune receptors. However, the receptors and the mechanisms for rearrangement that are utilized by jawless vertebrates clearly evolved independently of the gnathostome system. Unlike the jawed vertebrates, lamprey and hagfish also undergo extensive programmed rearrangements of the genome during embryonic development. By considering these fascinating genome biologies in the context of proposed (albeit contentious) phylogenetic relationships among lamprey, hagfish, and gnathostomes, we can begin to understand the evolutionary history of the vertebrate genome. Specifically, the deep shared ancestry and rapid divergence of lampreys, hagfish and gnathostomes is considered evidence that the two versions of programmed rearrangement present in lamprey and hagfish (embryonic and immune receptor) were present in an ancestral lineage that existed more than 400 million years ago and perhaps included the ancestor of the jawed vertebrates. Validating this premise will require better characterization of the genome sequence and mechanisms of rearrangement in lamprey and hagfish.
Collapse
Affiliation(s)
- J J Smith
- Benaroya Research Institute at Virginia Mason, 1201 9th Avenue, Seattle, WA 98101, USA.
| | | | | |
Collapse
|
31
|
Haubold B, Pfaffelhuber P, Lynch M. mlRho - a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes. Mol Ecol 2010; 19 Suppl 1:277-84. [PMID: 20331786 DOI: 10.1111/j.1365-294x.2009.04482.x] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Improvements in sequencing technology over the past 5 years are leading to routine application of shotgun sequencing in the fields of ecology and evolution. However, the theory to estimate evolutionary parameters from these data is still being worked out. Here we present an extension and implementation of part of this theory, mlRho. This program can efficiently compute the following three maximum likelihood estimators based on shotgun sequence data obtained from single diploid individuals: the population mutation rate (4N(e)mu), the sequencing error rate, and the population recombination rate (4N(e)c). We demonstrate the accuracy of mlRho by applying it to simulated data sets. In addition, we analyse the genomes of the sea squirt Ciona intestinalis and the water flea Daphnia pulex. Ciona intestinalis is an obligate outcrosser, while D. pulex is a cyclic parthenogen, and we discuss how these contrasting life histories are reflected in our parameter estimates. The program mlRho is freely available from http://guanine.evolbio.mpg.de/mlRho.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany.
| | | | | |
Collapse
|
32
|
Detection and correction of false segmental duplications caused by genome mis-assembly. Genome Biol 2010; 11:R28. [PMID: 20219098 PMCID: PMC2864568 DOI: 10.1186/gb-2010-11-3-r28] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2009] [Revised: 12/11/2009] [Accepted: 03/10/2010] [Indexed: 11/23/2022] Open
Abstract
A method for determining false segmental duplications in vertebrate genomes, thus correcting mis-assemblies and providing more accurate estimates of duplications. Diploid genomes with divergent chromosomes present special problems for assembly software as two copies of especially polymorphic regions may be mistakenly constructed, creating the appearance of a recent segmental duplication. We developed a method for identifying such false duplications and applied it to four vertebrate genomes. For each genome, we corrected mis-assemblies, improved estimates of the amount of duplicated sequence, and recovered polymorphisms between the sequenced chromosomes.
Collapse
|
33
|
Frith MC, Wan R, Horton P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res 2010; 38:e100. [PMID: 20110255 PMCID: PMC2853142 DOI: 10.1093/nar/gkq010] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
34
|
Vavouri T, Lehner B. Conserved noncoding elements and the evolution of animal body plans. Bioessays 2009; 31:727-35. [PMID: 19492354 DOI: 10.1002/bies.200900014] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
The genomes of vertebrates, flies, and nematodes contain highly conserved noncoding elements (CNEs). CNEs cluster around genes that regulate development, and where tested, they can act as transcriptional enhancers. Within an animal group CNEs are the most conserved sequences but between groups they are normally diverged beyond recognition. Alternative CNEs are, however, associated with an overlapping set of genes that control development in all animals. Here, we discuss the evidence that CNEs are part of the core gene regulatory networks (GRNs) that specify alternative animal body plans. The major animal groups arose >550 million years ago. We propose that the cis-regulatory inputs identified by CNEs arose during the "re-wiring" of regulatory interactions that occurred during early animal evolution. Consequently, different animal groups, with different core GRNs, contain alternative sets of CNEs. Due to the subsequent stability of animal body plans, these core regulatory sequences have been evolving in parallel under strong purifying selection in different animal groups.
Collapse
Affiliation(s)
- Tanya Vavouri
- EMBL-CRG Systems Biology Research Unit, Dr. Aiguader 88, Barcelona, Spain.
| | | |
Collapse
|
35
|
Abstract
Summary: We present a program to improve haplotype reconstruction by incorporating information from paired-end reads, and demonstrate its utility on simulated data. We find that given a fixed coverage, longer reads (implying fewer of them) are preferable. Availability: The executable and user manual can be freely downloaded from ftp://ftp.sanger.ac.uk/pub/zn1/HI. Contact:ql2@sanger.ac.uk
Collapse
Affiliation(s)
- Quan Long
- The Wellcome Trust Sanger Institute, Hinxton, Cambs, UK.
| | | | | | | |
Collapse
|
36
|
Kim JH, Kim WC, Waterman MS, Park S, Li LM. HAPLOWSER: a whole-genome haplotype browser for personal genome and metagenome. Bioinformatics 2009; 25:2430-1. [PMID: 19561337 DOI: 10.1093/bioinformatics/btp399] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Haplotype assembly is becoming a very important tool in genome sequencing of human and other organisms. Although haplotypes were previously inferred from genome assemblies, there has never been a comparative haplotype browser that depicts a global picture of whole-genome alignments among haplotypes of different organisms. We introduce a whole-genome HAPLotype brOWSER (HAPLOWSER), providing evolutionary perspectives from multiple aligned haplotypes and functional annotations. Haplowser enables the comparison of haplotypes from metagenomes, and associates conserved regions or the bases at the conserved regions with functional annotations and custom tracks. The associations are quantified for further analysis and presented as pie charts. Functional annotations and custom tracks that are projected onto haplotypes are saved as multiple files in FASTA format. Haplowser provides a user-friendly interface, and can display alignments of haplotypes with functional annotations at any resolution. AVAILABILITY Haplowser, written in Java, supports multiple platforms including Windows and Linux. Haplowser is publicly available at http://embio.yonsei.ac.kr/haplowser .
Collapse
Affiliation(s)
- Jong Hyun Kim
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | | | | | | | | |
Collapse
|
37
|
Barrière A, Yang SP, Pekarek E, Thomas CG, Haag ES, Ruvinsky I. Detecting heterozygosity in shotgun genome assemblies: Lessons from obligately outcrossing nematodes. Genome Res 2009; 19:470-80. [PMID: 19204328 DOI: 10.1101/gr.081851.108] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The majority of nematodes are gonochoristic (dioecious) with distinct male and female sexes, but the best-studied species, Caenorhabditis elegans, is a self-fertile hermaphrodite. The sequencing of the genomes of C. elegans and a second hermaphrodite, C. briggsae, was facilitated in part by the low amount of natural heterozygosity, which typifies selfing species. Ongoing genome projects for gonochoristic Caenorhabditis species seek to approximate this condition by intense inbreeding prior to sequencing. Here we show that despite this inbreeding, the heterozygous fraction of the whole genome shotgun assemblies of three gonochoristic Caenorhabditis species, C. brenneri, C. remanei, and C. japonica, is considerable. We first demonstrate experimentally that independently assembled sequence variants in C. remanei and C. brenneri are allelic. We then present gene-based approaches for recognizing heterozygous regions of WGS assemblies. We also develop a simple method for quantifying heterozygosity that can be applied to assemblies lacking gene annotations. Consistently we find that approximately 10% and 30% of the C. remanei and C. brenneri genomes, respectively, are represented by two alleles in the assemblies. Heterozygosity is restricted to autosomes and its retention is accompanied by substantial inbreeding depression, suggesting that it is caused by multiple recessive deleterious alleles and not merely by chance. Both the overall amount and chromosomal distribution of heterozygous DNA is highly variable between assemblies of close relatives produced by identical methodologies, and allele frequencies have continued to change after strains were sequenced. Our results highlight the impact of mating systems on genome sequencing projects.
Collapse
Affiliation(s)
- Antoine Barrière
- Department of Ecology and Evolution and Institute for Genomics and Systems Biology, The University of Chicago, Chicago, Illinois 60637, USA
| | | | | | | | | | | |
Collapse
|
38
|
Abstract
In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).
Collapse
|
39
|
Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 2008; 24:i153-9. [DOI: 10.1093/bioinformatics/btn298] [Citation(s) in RCA: 225] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
40
|
Thakur NL, Jain R, Natalio F, Hamer B, Thakur AN, Müller WE. Marine molecular biology: An emerging field of biological sciences. Biotechnol Adv 2008; 26:233-45. [DOI: 10.1016/j.biotechadv.2008.01.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2007] [Revised: 01/03/2008] [Accepted: 01/03/2008] [Indexed: 12/17/2022]
|
41
|
Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AWC, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC. The diploid genome sequence of an individual human. PLoS Biol 2008; 5:e254. [PMID: 17803354 PMCID: PMC1964779 DOI: 10.1371/journal.pbio.0050254] [Citation(s) in RCA: 1117] [Impact Index Per Article: 69.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2007] [Accepted: 07/30/2007] [Indexed: 01/20/2023] Open
Abstract
Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
Collapse
Affiliation(s)
- Samuel Levy
- J. Craig Venter Institute, Rockville, Maryland, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Harada Y, Takagaki Y, Sunagawa M, Saito T, Yamada L, Taniguchi H, Shoguchi E, Sawada H. Mechanism of self-sterility in a hermaphroditic chordate. Science 2008; 320:548-50. [PMID: 18356489 DOI: 10.1126/science.1152488] [Citation(s) in RCA: 115] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Hermaphroditic organisms avoid inbreeding by a system of self-incompatibility (SI). A primitive chordate (ascidian) Ciona intestinalis is an example of such an organism, but the molecular mechanism underlying its SI system is not known. Here, we show that the SI system is governed by two gene loci that act cooperatively. Each locus contains a tightly linked pair of polycystin 1-related receptor (s-Themis) and fibrinogen-like ligand (v-Themis) genes, the latter of which is located in the first intron of s-Themis but transcribed in the opposite direction. These genes may encode male- and female-side self-recognition molecules. The SI system of C. intestinalis has a similar framework to that of flowering plants but utilizing different molecules.
Collapse
Affiliation(s)
- Yoshito Harada
- Sugashima Marine Biological Laboratory, Graduate School of Science, Nagoya University, Sugashima, Toba 517-0004, Japan.
| | | | | | | | | | | | | | | |
Collapse
|
43
|
Denisov G, Walenz B, Halpern AL, Miller J, Axelrod N, Levy S, Sutton G. Consensus generation and variant detection by Celera Assembler. Bioinformatics 2008; 24:1035-40. [PMID: 18321888 DOI: 10.1093/bioinformatics/btn074] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. RESULTS Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2,033311 detected regions of sequence variation. In 33,269 out of 460,373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. AVAILABILITY The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/
Collapse
Affiliation(s)
- Gennady Denisov
- J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | |
Collapse
|