1
|
Poszewiecka B, Gogolewski K, Karolak JA, Stankiewicz P, Gambin A. PhaseDancer: a novel targeted assembler of segmental duplications unravels the complexity of the human chromosome 2 fusion going from 48 to 46 chromosomes in hominin evolution. Genome Biol 2023; 24:205. [PMID: 37697406 PMCID: PMC10496407 DOI: 10.1186/s13059-023-03022-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 07/25/2023] [Indexed: 09/13/2023] Open
Abstract
Resolving complex genomic regions rich in segmental duplications (SDs) is challenging due to the high error rate of long-read sequencing. Here, we describe a targeted approach with a novel genome assembler PhaseDancer that extends SD-rich regions of interest iteratively. We validate its robustness and efficiency using a golden-standard set of human BAC clones and in silico-generated SDs with predefined evolutionary scenarios. PhaseDancer enables extension of the incomplete complex SD-rich subtelomeric regions of Great Ape chromosomes orthologous to the human chromosome 2 (HSA2) fusion site, informing a model of HSA2 formation and unravelling the evolution of human and Great Ape genomes.
Collapse
Affiliation(s)
- Barbara Poszewiecka
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
| | - Krzysztof Gogolewski
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
| | - Justyna A. Karolak
- Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, 77030 Houston, TX USA
- Chair and Department of Genetics and Pharmaceutical Microbiology, Poznan University of Medical Sciences, 60-806 Poznan, Poland
| | - Paweł Stankiewicz
- Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, 77030 Houston, TX USA
| | - Anna Gambin
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
| |
Collapse
|
2
|
Saada OA, Friedrich A, Schacherer J. Towards accurate, contiguous and complete alignment-based polyploid phasing algorithms. Genomics 2022; 114:110369. [PMID: 35483655 DOI: 10.1016/j.ygeno.2022.110369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Revised: 03/09/2022] [Accepted: 04/11/2022] [Indexed: 01/14/2023]
Abstract
Phasing, and in particular polyploid phasing, have been challenging problems held back by the limited read length of high-throughput short read sequencing methods which can't overcome the distance between heterozygous sites and labor high cost of alternative methods such as the physical separation of chromosomes for example. Recently developed single molecule long-read sequencing methods provide much longer reads which overcome this previous limitation. Here we review the alignment-based methods of polyploid phasing that rely on four main strategies: population inference methods, which leverage the genetic information of several individuals to phase a sample; objective function minimization methods, which minimize a function such as the Minimum Error Correction (MEC); graph partitioning methods, which represent the read data as a graph and split it into k haplotype subgraphs; cluster building methods, which iteratively grow clusters of similar reads into a final set of clusters that represent the haplotypes. We discuss the advantages and limitations of these methods and the metrics used to assess their performance, proposing that accuracy and contiguity are the most meaningful metrics. Finally, we propose the field of alignment-based polyploid phasing would greatly benefit from the use of a well-designed benchmarking dataset with appropriate evaluation metrics. We consider that there are still significant improvements which can be achieved to obtain more accurate and contiguous polyploid phasing results which reflect the complexity of polyploid genome architectures.
Collapse
Affiliation(s)
- Omar Abou Saada
- Université de Strasbourg, CNRS, GMGM UMR, 7156 Strasbourg, France
| | - Anne Friedrich
- Université de Strasbourg, CNRS, GMGM UMR, 7156 Strasbourg, France
| | - Joseph Schacherer
- Université de Strasbourg, CNRS, GMGM UMR, 7156 Strasbourg, France; Institut Universitaire de France (IUF), Paris, France.
| |
Collapse
|
3
|
Warren WC, Harris RA, Haukness M, Fiddes IT, Murali SC, Fernandes J, Dishuck PC, Storer JM, Raveendran M, Hillier LW, Porubsky D, Mao Y, Gordon D, Vollger MR, Lewis AP, Munson KM, DeVogelaere E, Armstrong J, Diekhans M, Walker JA, Tomlinson C, Graves-Lindsay TA, Kremitzki M, Salama SR, Audano PA, Escalona M, Maurer NW, Antonacci F, Mercuri L, Maggiolini FAM, Catacchio CR, Underwood JG, O'Connor DH, Sanders AD, Korbel JO, Ferguson B, Kubisch HM, Picker L, Kalin NH, Rosene D, Levine J, Abbott DH, Gray SB, Sanchez MM, Kovacs-Balint ZA, Kemnitz JW, Thomasy SM, Roberts JA, Kinnally EL, Capitanio JP, Skene JHP, Platt M, Cole SA, Green RE, Ventura M, Wiseman RW, Paten B, Batzer MA, Rogers J, Eichler EE. Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility. Science 2021; 370:370/6523/eabc6617. [PMID: 33335035 DOI: 10.1126/science.abc6617] [Citation(s) in RCA: 73] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Accepted: 10/29/2020] [Indexed: 12/15/2022]
Abstract
The rhesus macaque (Macaca mulatta) is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.
Collapse
Affiliation(s)
- Wesley C Warren
- Department of Animal Sciences, Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA. .,Department of Surgery, School of Medicine, University of Missouri, Columbia, MO 65211, USA.,Institute of Data Science and Informatics, University of Missouri, Columbia, MO 65211, USA
| | - R Alan Harris
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Marina Haukness
- Computational Genomics Laboratory, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | | | - Shwetha C Murali
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | - Jason Fernandes
- Department of Biomolecular Engineering, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Philip C Dishuck
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Jessica M Storer
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA.,Institue for Systems Biology, Seattle, WA 98109, USA
| | - Muthuswamy Raveendran
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - LaDeana W Hillier
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Yafei Mao
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - David Gordon
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Elizabeth DeVogelaere
- Computational Genomics Laboratory, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Joel Armstrong
- Computational Genomics Laboratory, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Mark Diekhans
- Computational Genomics Laboratory, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Jerilyn A Walker
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University, St. Louis, MO 63108, USA
| | | | - Milinn Kremitzki
- McDonnell Genome Institute, Washington University, St. Louis, MO 63108, USA
| | - Sofie R Salama
- Department of Biomolecular Engineering, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Peter A Audano
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Merly Escalona
- Department of Biomolecular Engineering, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nicholas W Maurer
- Department of Biomolecular Engineering, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | | | - Ludovica Mercuri
- Department of Biology, University of Bari 'Aldo Moro', 70125 Bari, Italy
| | | | | | | | - David H O'Connor
- Department of Pathology and Laboratory Medicine, Wisconsin National Primate Research Center, University of Wisconsin-Madison, Madison, WI 53711, USA
| | - Ashley D Sanders
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Jan O Korbel
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Betsy Ferguson
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Beaverton, OR 97006, USA
| | | | - Louis Picker
- Oregon National Primate Research Center and Vaccine and Gene Therapy Institute, Oregon Health Sciences University, Beaverton, OR 97006, USA
| | - Ned H Kalin
- Department of Psychiatry, University of Wisconsin School of Medicine and Public Health, Madison, WI 53719, USA
| | - Douglas Rosene
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA 02118, USA
| | - Jon Levine
- Department of Neuroscience, University of Wisconsin, Madison, WI 53175, USA.,Wisconsin National Primate Research Center, University of Wisconsin, Madison, WI 53171, USA
| | - David H Abbott
- Wisconsin National Primate Research Center, University of Wisconsin, Madison, WI 53171, USA.,Department of Obstetrics and Gynecology, Wisconsin National Primate Research Center, University of Wisconsin, Madison, WI 53715, USA
| | - Stanton B Gray
- The University of Texas MD Anderson Cancer Center, Michale E. Keeling Center for Comparative Medicine and Research, Bastrop, TX 78602, USA
| | - Mar M Sanchez
- Yerkes National Primate Research Center, Atlanta, GA 30329, USA.,Department of Psychiatry and Behavioral Sciences, Emory University School of Medicine, Atlanta, GA 30329, USA
| | | | - Joseph W Kemnitz
- Wisconsin National Primate Research Center, University of Wisconsin, Madison, WI 53171, USA.,Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI 53706, USA
| | - Sara M Thomasy
- Department of Surgical and Radiological Sciences, School of Veterinary Medicine, University of California-Davis, Davis, CA 95616, USA.,Department of Ophthalmology and Vision Science, School of Medicine, University of California-Davis, Davis, CA 95817, USA
| | | | - Erin L Kinnally
- California National Primate Research Center, Davis, CA 95616, USA.,Department of Psychology, University of California, Davis, CA 95616, USA
| | - John P Capitanio
- California National Primate Research Center, Davis, CA 95616, USA.,Department of Psychology, University of California, Davis, CA 95616, USA
| | - J H Pate Skene
- Department of Neurobiology, Duke University School of Medicine, Durham, NC 27710, USA
| | - Michael Platt
- Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Shelley A Cole
- Population Health Program, Texas Biomedical Research Institute and Southwest National Primate Research Center, San Antonio, TX 78227, USA
| | - Richard E Green
- Department of Biomolecular Engineering, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Mario Ventura
- Department of Biology, University of Bari 'Aldo Moro', 70125 Bari, Italy
| | - Roger W Wiseman
- Department of Pathology and Laboratory Medicine, Wisconsin National Primate Research Center, University of Wisconsin-Madison, Madison, WI 53711, USA
| | - Benedict Paten
- Computational Genomics Laboratory, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | - Mark A Batzer
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Jeffrey Rogers
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA. .,Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
4
|
Joshi D, Mao S, Kannan S, Diggavi S. QAlign: aligning nanopore reads accurately using current-level modeling. Bioinformatics 2020; 37:625-633. [PMID: 33051648 PMCID: PMC8097683 DOI: 10.1093/bioinformatics/btaa875] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 07/17/2020] [Accepted: 09/29/2020] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Efficient and accurate alignment of DNA/RNA sequence reads to each other or to a reference genome/transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this article, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome/transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner. RESULTS We show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2, 2.5 and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets. AVAILABILITY AND IMPLEMENTATION https://github.com/joshidhaivat/QAlign.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dhaivat Joshi
- Electrical & Computer Engineering, University of California, Los Angeles, CA 90095, USA
| | - Shunfu Mao
- Electrical & Computer Engineering, University of Washington, Seattle, WA 98195, USA
| | - Sreeram Kannan
- Electrical & Computer Engineering, University of Washington, Seattle, WA 98195, USA,To whom correspondence should be addressed. or
| | - Suhas Diggavi
- Electrical & Computer Engineering, University of California, Los Angeles, CA 90095, USA,To whom correspondence should be addressed. or
| |
Collapse
|
5
|
Schrinner SD, Mari RS, Ebler J, Rautiainen M, Seillier L, Reimer JJ, Usadel B, Marschall T, Klau GW. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol 2020; 21:252. [PMID: 32951599 PMCID: PMC7504856 DOI: 10.1186/s13059-020-02158-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 08/26/2020] [Indexed: 01/19/2023] Open
Abstract
Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WHATSHAP POLYPHASE, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap.
Collapse
Affiliation(s)
- Sven D Schrinner
- Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany
| | - Rebecca Serra Mari
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Graduate School of Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Graduate School of Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarbrücken, 66123, Germany
| | - Lancelot Seillier
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
| | - Julia J Reimer
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
| | - Björn Usadel
- Forschungszentrum Jülich IBG-4, Wilhelm-Johnen-Str., Jülich, 52428, Germany
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany.
| | - Gunnar W Klau
- Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany.
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany.
| |
Collapse
|
6
|
Sankararaman A, Vikalo H, Baccelli F. ComHapDet: a spatial community detection algorithm for haplotype assembly. BMC Genomics 2020; 21:586. [PMID: 32900369 PMCID: PMC7488034 DOI: 10.1186/s12864-020-06935-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. RESULTS We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. CONCLUSIONS Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Collapse
Affiliation(s)
- Abishek Sankararaman
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - François Baccelli
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.,Department of Mathematics, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
7
|
Shen F, Kidd JM. Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2. Genes (Basel) 2020; 11:genes11020141. [PMID: 32013076 PMCID: PMC7073954 DOI: 10.3390/genes11020141] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 01/21/2020] [Accepted: 01/24/2020] [Indexed: 12/22/2022] Open
Abstract
Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.
Collapse
Affiliation(s)
- Feichen Shen
- Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Jeffrey M. Kidd
- Department of Human Genetics and Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Correspondence:
| |
Collapse
|
8
|
Mazrouee S, Wang W. PolyCluster: Minimum Fragment Disagreement Clustering for Polyploid Phasing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:264-277. [PMID: 30040655 DOI: 10.1109/tcbb.2018.2858803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Phasing is an emerging area in computational biology with important applications in clinical decision making and biomedical sciences. While machine learning techniques have shown tremendous potential in many biomedical applications, their utility in phasing has not yet been fully understood. In this paper, we investigate development of clustering-based techniques for phasing in polyploidy organisms where more than two copies of each chromosome exist in the cells of the organism under study. We develop a novel framework, called PolyCluster, based on the concept of correlation clustering followed by an effective cluster merging mechanism to minimize the amount of disagreement among short reads residing in each cluster. We first introduce a graph model to quantify the amount of similarity between each pair of DNA reads. We then present a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Our extensive analysis demonstrates the effectiveness of PolyCluster in accurate and scalable phasing. In particular, we show that PolyCluster reduces switching error of H-PoP, HapColor, and HapTree by 44.4, 51.2, and 48.3 percent, respectively. Also, the running time of PolyCluster is several orders-of-magnitude less than HapTree while it achieves a running time comparable to other algorithms.
Collapse
|
9
|
Du H, Liang C. Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads. Nat Commun 2019; 10:5360. [PMID: 31767853 PMCID: PMC6877557 DOI: 10.1038/s41467-019-13355-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 11/04/2019] [Indexed: 01/27/2023] Open
Abstract
The abundant repetitive sequences in complex eukaryotic genomes cause fragmented assemblies, which lose value as reference genomes, often due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes. Here we report a genome assembly method HERA, which resolves repeats efficiently by constructing a connection graph from an overlap graph. We test HERA on the genomes of rice, maize, human, and Tartary buckwheat with single-molecule sequencing and mapping data. HERA correctly assembles most of the previously unassembled regions, resulting in dramatically improved, highly contiguous genome assemblies with newly assembled gene sequences. For example, the maize contig N50 size reaches 61.2 Mb and the Tartary buckwheat genome comprises only 20 contigs. HERA can also be used to fill gaps and fix errors in reference genomes. The application of HERA will greatly improve the quality of new or existing assemblies of complex genomes.
Collapse
Affiliation(s)
- Huilong Du
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, 1 Beichen West Road No. 2, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Chengzhi Liang
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, 1 Beichen West Road No. 2, Beijing, 100101, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
10
|
Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data. Bioinformatics 2019; 35:20-27. [PMID: 30561550 DOI: 10.1093/bioinformatics/bty544] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2017] [Accepted: 06/28/2018] [Indexed: 02/01/2023] Open
Abstract
Motivation Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity. Results We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint. Availability and implementation lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ehsan Haghshenas
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - S Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada.,School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, BC, Canada.,Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
11
|
Edge P, Bansal V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun 2019; 10:4660. [PMID: 31604920 PMCID: PMC6788989 DOI: 10.1038/s41467-019-12493-y] [Citation(s) in RCA: 120] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2019] [Accepted: 09/10/2019] [Indexed: 12/30/2022] Open
Abstract
Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads. Single-molecule sequencing (SMS) such as Pacific Biosciences and Oxford Nanopore generate long reads with high error rate. Here, the authors develop Longshot, a computational method that detects and phases single nucleotide variants (SNV) in diploid genomes using SMS data.
Collapse
Affiliation(s)
- Peter Edge
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, 92093, USA
| | - Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California, 92093, USA.
| |
Collapse
|
12
|
Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 2019. [PMID: 30992455 DOI: 10.1038/s41467‐018‐08148‐z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
Collapse
|
13
|
Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 2019; 10:1784. [PMID: 30992455 PMCID: PMC6467913 DOI: 10.1038/s41467-018-08148-z] [Citation(s) in RCA: 489] [Impact Index Per Article: 97.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 12/20/2018] [Indexed: 12/30/2022] Open
Abstract
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
Collapse
|
14
|
Vollger MR, Dishuck PC, Sorensen M, Welch AE, Dang V, Dougherty ML, Graves-Lindsay TA, Wilson RK, Chaisson MJP, Eichler EE. Long-read sequence and assembly of segmental duplications. Nat Methods 2019; 16:88-94. [PMID: 30559433 PMCID: PMC6382464 DOI: 10.1038/s41592-018-0236-3] [Citation(s) in RCA: 81] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 10/30/2018] [Indexed: 01/22/2023]
Abstract
We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
Collapse
Affiliation(s)
- Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Philip C Dishuck
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Melanie Sorensen
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - AnneMarie E Welch
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Vy Dang
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Max L Dougherty
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tina A Graves-Lindsay
- The McDonnell Genome Institute at Washington University, Washington University School of Medicine, St. Louis, MO, USA
| | - Richard K Wilson
- Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | | | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
15
|
Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 2018; 36:nbt.4277. [PMID: 30346939 PMCID: PMC6476705 DOI: 10.1038/nbt.4277] [Citation(s) in RCA: 264] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 09/10/2018] [Indexed: 12/20/2022]
Abstract
Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly. In contrast with prior approaches, the effectiveness of our method improved with increasing heterozygosity. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. We used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches. We sequenced an F1 cross between the cattle subspecies Bos taurus taurus and Bos taurus indicus and completely assembled both parental haplotypes with NG50 haplotig sizes of >20 Mb and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We suggest that trio binning improves diploid genome assembly and will facilitate new studies of haplotype variation and inheritance.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Brian P. Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Alexander T. Dilthey
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
- Institute of Medical Microbiology, Heinrich-Heine-University Düsseldorf, Düsseldorf, North Rhine-Westphalia, Germany
| | - Derek M. Bickhart
- Cell Wall Biology and Utilization Laboratory, ARS USDA, Madison, Wisconsin, USA
| | | | - Stefan Hiendleder
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia
- Robinson Research Institute, The University of Adelaide, Adelaide SA, Australia
| | - John L. Williams
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia
| | | | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| |
Collapse
|
16
|
Wilfert AB, Sulovari A, Turner TN, Coe BP, Eichler EE. Recurrent de novo mutations in neurodevelopmental disorders: properties and clinical implications. Genome Med 2017; 9:101. [PMID: 29179772 PMCID: PMC5704398 DOI: 10.1186/s13073-017-0498-x] [Citation(s) in RCA: 91] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Next-generation sequencing (NGS) is now more accessible to clinicians and researchers. As a result, our understanding of the genetics of neurodevelopmental disorders (NDDs) has rapidly advanced over the past few years. NGS has led to the discovery of new NDD genes with an excess of recurrent de novo mutations (DNMs) when compared to controls. Development of large-scale databases of normal and disease variation has given rise to metrics exploring the relative tolerance of individual genes to human mutation. Genetic etiology and diagnosis rates have improved, which have led to the discovery of new pathways and tissue types relevant to NDDs. In this review, we highlight several key findings based on the discovery of recurrent DNMs ranging from copy number variants to point mutations. We explore biases and patterns of DNM enrichment and the role of mosaicism and secondary mutations in variable expressivity. We discuss the benefit of whole-genome sequencing (WGS) over whole-exome sequencing (WES) to understand more complex, multifactorial cases of NDD and explain how this improved understanding aids diagnosis and management of these disorders. Comprehensive assessment of the DNM landscape across the genome using WGS and other technologies will lead to the development of novel functional and bioinformatics approaches to interpret DNMs and drive new insights into NDD biology.
Collapse
Affiliation(s)
- Amy B Wilfert
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Arvis Sulovari
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Tychele N Turner
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Bradley P Coe
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|