1
|
Li M, Li LM. RegScaf: a regression approach to scaffolding. Bioinformatics 2022; 38:2675-2682. [PMID: 35561180 PMCID: PMC9326850 DOI: 10.1093/bioinformatics/btac174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 02/19/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Crucial to the correctness of a genome assembly is the accuracy of the underlying scaffolds that specify the orders and orientations of contigs together with the gap distances between contigs. The current methods construct scaffolds based on the alignments of 'linking' reads against contigs. We found that some 'optimal' alignments are mistaken due to factors such as the contig boundary effect, particularly in the presence of repeats. Occasionally, the incorrect alignments can even overwhelm the correct ones. The detection of the incorrect linking information is challenging in any existing methods. RESULTS In this study, we present a novel scaffolding method RegScaf. It first examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode. The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions. The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances. The results on both synthetic and real datasets demonstrate that RegScaf outperforms some popular scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplified by a real case. Its adaptability to large genomes and TGS long reads is validated as well. AVAILABILITY AND IMPLEMENTATION RegScaf is publicly available at https://github.com/lemontealala/RegScaf.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengtian Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lei M Li
- To whom correspondence should be addressed.
| |
Collapse
|
2
|
Nakamoto M, Uchino T, Koshimizu E, Kuchiishi Y, Sekiguchi R, Wang L, Sudo R, Endo M, Guiguen Y, Schartl M, Postlethwait JH, Sakamoto T. A Y-linked anti-Müllerian hormone type-II receptor is the sex-determining gene in ayu, Plecoglossus altivelis. PLoS Genet 2021; 17:e1009705. [PMID: 34437539 PMCID: PMC8389408 DOI: 10.1371/journal.pgen.1009705] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/09/2021] [Indexed: 11/19/2022] Open
Abstract
Whole-genome duplication and genome compaction are thought to have played important roles in teleost fish evolution. Ayu (or sweetfish), Plecoglossus altivelis, belongs to the superorder Stomiati, order Osmeriformes. Stomiati is phylogenetically classified as sister taxa of Neoteleostei. Thus, ayu holds an important position in the fish tree of life. Although ayu is economically important for the food industry and recreational fishing in Japan, few genomic resources are available for this species. To address this problem, we produced a draft genome sequence of ayu by whole-genome shotgun sequencing and constructed linkage maps using a genotyping-by-sequencing approach. Syntenic analyses of ayu and other teleost fish provided information about chromosomal rearrangements during the divergence of Stomiati, Protacanthopterygii and Neoteleostei. The size of the ayu genome indicates that genome compaction occurred after the divergence of the family Osmeridae. Ayu has an XX/XY sex-determination system for which we identified sex-associated loci by a genome-wide association study by genotyping-by-sequencing and whole-genome resequencing using wild populations. Genome-wide association mapping using wild ayu populations revealed three sex-linked scaffolds (total, 2.03 Mb). Comparison of whole-genome resequencing mapping coverage between males and females identified male-specific regions in sex-linked scaffolds. A duplicate copy of the anti-Müllerian hormone type-II receptor gene (amhr2bY) was found within these male-specific regions, distinct from the autosomal copy of amhr2. Expression of the Y-linked amhr2 gene was male-specific in sox9b-positive somatic cells surrounding germ cells in undifferentiated gonads, whereas autosomal amhr2 transcripts were detected in somatic cells in sexually undifferentiated gonads of both genetic males and females. Loss-of-function mutation for amhr2bY induced male to female sex reversal. Taken together with the known role of Amh and Amhr2 in sex differentiation, these results indicate that the paralog of amhr2 on the ayu Y chromosome determines genetic sex, and the male-specific amh-amhr2 pathway is critical for testicular differentiation in ayu. Ayu (or sweetfish), Plecoglossus altivelis, is widely distributed in East Asia. Ayu belongs to the superorder Stomiati and the order Osmeriformes. Stomiati is phylogenetically classified as sister group of Neoteleostei, the largest clade of bony fish including medaka, tuna and cod. The divergence of Protacanthopterygii (salmon and pike) and the common ancestor of Stomiati and Neoteleostei is estimated to have occurred approximately 190 million years ago. Thus, ayu holds an important position in the fish tree of life. We sequenced the ayu genome and constructed linkage maps using a genotyping-by-sequencing approach. Comparative analyses of ayu, medaka and northern pike revealed chromosomal rearrangements in the ayu lineage after the divergence of ayu and northern pike. Association mapping revealed a duplicate copy of the anti-Müllerian hormone type-II receptor gene (amhr2bY) located within a male-specific region. Y-linked amhr2 expression was male-specific in supporting cells in undifferentiated gonads, whereas autosomal amhr2 transcripts were detected in somatic cells of sexually undifferentiated gonads in both. Loss-of-function mutation for amhr2bY induced male-to-female sex reversal. Taken together, these results indicate that the paralog of amhr2 on the Y chromosome determines genetic sex. Our findings support the hypothesis that the male-specific amh-amhr2 pathway is critical for gonadal differentiation in ayu.
Collapse
Affiliation(s)
- Masatoshi Nakamoto
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | - Tsubasa Uchino
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | - Eriko Koshimizu
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
- Department of Human Genetics, Yokohama City University, Graduate School of Medicine, Yokohama, Japan
| | - Yudai Kuchiishi
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | - Ryota Sekiguchi
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | - Liu Wang
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | - Ryusuke Sudo
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | - Masato Endo
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
| | | | - Manfred Schartl
- University of Wuerzburg, Developmental Biochemistry, Biocenter, Würzburg, Germany
- The Xiphophorus Genetic Stock Center, Department of Chemistry and Biochemistry, Texas State University, San Marcos, Texas, United States of America
| | - John H. Postlethwait
- Institute of Neuroscience, University of Oregon, Eugene, Oregon, United States of America
| | - Takashi Sakamoto
- Department of Marine Biosciences, Tokyo University of Marine Science and Technology, Tokyo, Japan
- * E-mail:
| |
Collapse
|
3
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmueller T, Sczyrba A, Dilthey A, Klawonn F, McHardy AC. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol 2021; 22:212. [PMID: 34281604 PMCID: PMC8287296 DOI: 10.1186/s13059-021-02426-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 06/29/2021] [Indexed: 01/03/2023] Open
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
Affiliation(s)
- Adrian Fritz
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Andreas Bremges
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Robin Lesker
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Jasper Götting
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - Tina Ganzenmueller
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - Alexander Sczyrba
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alice Carolyn McHardy
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany.
| |
Collapse
|
4
|
Kronenberg ZN, Rhie A, Koren S, Concepcion GT, Peluso P, Munson KM, Porubsky D, Kuhn K, Mueller KA, Low WY, Hiendleder S, Fedrigo O, Liachko I, Hall RJ, Phillippy AM, Eichler EE, Williams JL, Smith TPL, Jarvis ED, Sullivan ST, Kingan SB. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat Commun 2021; 12:1935. [PMID: 33911078 PMCID: PMC8081726 DOI: 10.1038/s41467-020-20536-y] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 11/12/2020] [Indexed: 01/27/2023] Open
Abstract
Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80-91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.
Collapse
Affiliation(s)
- Zev N Kronenberg
- Phase Genomics, Seattle, WA, USA.
- Pacific Biosciences, Menlo Park, CA, USA.
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | | | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Kristen Kuhn
- US Meat Animal Research Center, ARS USDA, Clay Center, NE, USA
| | | | - Wai Yee Low
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy, SA, Australia
| | - Stefan Hiendleder
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy, SA, Australia
| | - Olivier Fedrigo
- Vertebrate Genomes Laboratory, The Rockefeller University, New York, NY, USA
| | | | | | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - John L Williams
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy, SA, Australia
- Dipartimento di Scienze Animali, della Nutrizione e degli Alimenti, Università Cattolica del Sacro Cuore, 29122, Piacenza, Italy
| | | | - Erich D Jarvis
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | | | | |
Collapse
|
5
|
Deneke C, Brendebach H, Uelze L, Borowiak M, Malorny B, Tausch SH. Species-Specific Quality Control, Assembly and Contamination Detection in Microbial Isolate Sequences with AQUAMIS. Genes (Basel) 2021; 12:644. [PMID: 33926025 PMCID: PMC8145556 DOI: 10.3390/genes12050644] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 04/23/2021] [Accepted: 04/24/2021] [Indexed: 01/13/2023] Open
Abstract
Sequencing of whole microbial genomes has become a standard procedure for cluster detection, source tracking, outbreak investigation and surveillance of many microorganisms. An increasing number of laboratories are currently in a transition phase from classical methods towards next generation sequencing, generating unprecedented amounts of data. Since the precision of downstream analyses depends significantly on the quality of raw data generated on the sequencing instrument, a comprehensive, meaningful primary quality control is indispensable. Here, we present AQUAMIS, a Snakemake workflow for an extensive quality control and assembly of raw Illumina sequencing data, allowing laboratories to automatize the initial analysis of their microbial whole-genome sequencing data. AQUAMIS performs all steps of primary sequence analysis, consisting of read trimming, read quality control (QC), taxonomic classification, de-novo assembly, reference identification, assembly QC and contamination detection, both on the read and assembly level. The results are visualized in an interactive HTML report including species-specific QC thresholds, allowing non-bioinformaticians to assess the quality of sequencing experiments at a glance. All results are also available as a standard-compliant JSON file, facilitating easy downstream analyses and data exchange. We have applied AQUAMIS to analyze ~13,000 microbial isolates as well as ~1000 in-silico contaminated datasets, proving the workflow's ability to perform in high throughput routine sequencing environments and reliably predict contaminations. We found that intergenus and intragenus contaminations can be detected most accurately using a combination of different QC metrics available within AQUAMIS.
Collapse
Affiliation(s)
| | | | | | | | | | - Simon H. Tausch
- Department Biological Safety, German Federal Institute for Risk Assessment, 10589 Berlin, Germany; (C.D.); (H.B.); (L.U.); (M.B.); (B.M.)
| |
Collapse
|
6
|
Abstract
In this article, we present QuASeR, a reference-free DNA sequence reconstruction implementation via de novo assembly on both gate-based and quantum annealing platforms. This is the first time this important application in bioinformatics is modeled using quantum computation. Each one of the four steps of the implementation (TSP, QUBO, Hamiltonians and QAOA) is explained with a proof-of-concept example to target both the genomics research community and quantum application developers in a self-contained manner. The implementation and results on executing the algorithm from a set of DNA reads to a reconstructed sequence, on a gate-based quantum simulator, the D-Wave quantum annealing simulator and hardware are detailed. We also highlight the limitations of current classical simulation and available quantum hardware systems. The implementation is open-source and can be found on https://github.com/QE-Lab/QuASeR.
Collapse
Affiliation(s)
- Aritra Sarkar
- Department of Quantum and Computer Engineering, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Zaid Al-Ars
- Department of Quantum and Computer Engineering, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Koen Bertels
- Department of Informatics Engineering, Faculty of Engineering, University of Porto, Porto, Portugal
| |
Collapse
|
7
|
Di Genova A, Buena-Atienza E, Ossowski S, Sagot MF. Efficient hybrid de novo assembly of human genomes with WENGAN. Nat Biotechnol 2021; 39:422-430. [PMID: 33318652 PMCID: PMC8041623 DOI: 10.1038/s41587-020-00747-w] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Revised: 10/08/2020] [Accepted: 10/21/2020] [Indexed: 12/12/2022]
Abstract
Generating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50: 17.24-80.64 Mb), few assembly errors (contig NGA50: 11.8-59.59 Mb), good consensus quality (QV: 27.84-42.88) and high gene completeness (BUSCO complete: 94.6-95.2%), while consuming low computational resources (CPU hours: 187-1,200). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb).
Collapse
Affiliation(s)
- Alex Di Genova
- Inria Grenoble Rhône-Alpes, Montbonnot, France.
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, Villeurbanne, France.
| | - Elena Buena-Atienza
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
- NGS Competence Center Tübingen (NCCT), University of Tübingen, Tübingen, Germany
| | - Stephan Ossowski
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
- NGS Competence Center Tübingen (NCCT), University of Tübingen, Tübingen, Germany
| | - Marie-France Sagot
- Inria Grenoble Rhône-Alpes, Montbonnot, France.
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, Villeurbanne, France.
| |
Collapse
|
8
|
Collins JH, Keating KW, Jones TR, Balaji S, Marsan CB, Çomo M, Newlon ZJ, Mitchell T, Bartley B, Adler A, Roehner N, Young EM. Engineered yeast genomes accurately assembled from pure and mixed samples. Nat Commun 2021; 12:1485. [PMID: 33674578 PMCID: PMC7935868 DOI: 10.1038/s41467-021-21656-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Accepted: 02/04/2021] [Indexed: 01/31/2023] Open
Abstract
Yeast whole genome sequencing (WGS) lacks end-to-end workflows that identify genetic engineering. Here we present Prymetime, a tool that assembles yeast plasmids and chromosomes and annotates genetic engineering sequences. It is a hybrid workflow-it uses short and long reads as inputs to perform separate linear and circular assembly steps. This structure is necessary to accurately resolve genetic engineering sequences in plasmids and the genome. We show this by assembling diverse engineered yeasts, in some cases revealing unintended deletions and integrations. Furthermore, the resulting whole genomes are high quality, although the underlying assembly software does not consistently resolve highly repetitive genome features. Finally, we assemble plasmids and genome integrations from metagenomic sequencing, even with 1 engineered cell in 1000. This work is a blueprint for building WGS workflows and establishes WGS-based identification of yeast genetic engineering.
Collapse
Affiliation(s)
- Joseph H Collins
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Kevin W Keating
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Trent R Jones
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Shravani Balaji
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Celeste B Marsan
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Marina Çomo
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Zachary J Newlon
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Tom Mitchell
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Bryan Bartley
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Aaron Adler
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Nicholas Roehner
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Eric M Young
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA.
| |
Collapse
|
9
|
Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A. Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microb Genom 2020; 6:mgen000398. [PMID: 32579097 PMCID: PMC7660248 DOI: 10.1099/mgen.0.000398] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 06/02/2020] [Indexed: 12/22/2022] Open
Abstract
Plasmids are extrachromosomal genetic elements that replicate independently of the chromosome and play a vital role in the environmental adaptation of bacteria. Due to potential mobilization or conjugation capabilities, plasmids are important genetic vehicles for antimicrobial resistance genes and virulence factors with huge and increasing clinical implications. They are therefore subject to large genomic studies within the scientific community worldwide. As a result of rapidly improving next-generation sequencing methods, the quantity of sequenced bacterial genomes is constantly increasing, in turn raising the need for specialized tools to (i) extract plasmid sequences from draft assemblies, (ii) derive their origin and distribution, and (iii) further investigate their genetic repertoire. Recently, several bioinformatic methods and tools have emerged to tackle this issue; however, a combination of high sensitivity and specificity in plasmid sequence identification is rarely achieved in a taxon-independent manner. In addition, many software tools are not appropriate for large high-throughput analyses or cannot be included in existing software pipelines due to their technical design or software implementation. In this study, we investigated differences in the replicon distributions of protein-coding genes on a large scale as a new approach to distinguish plasmid-borne from chromosome-borne contigs. We defined and computed statistical discrimination thresholds for a new metric: the replicon distribution score (RDS), which achieved an accuracy of 96.6 %. The final performance was further improved by the combination of the RDS metric with heuristics exploiting several plasmid-specific higher-level contig characterizations. We implemented this workflow in a new high-throughput taxon-independent bioinformatics software tool called Platon for the recruitment and characterization of plasmid-borne contigs from short-read draft assemblies. Compared to PlasFlow, Platon achieved a higher accuracy (97.5 %) and more balanced predictions (F1=82.6 %) tested on a broad range of bacterial taxa and better or equal performance against the targeted tools PlasmidFinder and PlaScope on sequenced Escherichia coli isolates. Platon is available at: http://platon.computational.bio/.
Collapse
Affiliation(s)
- Oliver Schwengers
- Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, Germany
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, Germany
- German Center for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany
| | - Patrick Barth
- Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, Germany
| | - Linda Falgenhauer
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, Germany
- German Center for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany
- Present address: Institute of Hygiene and Environmental Health, Justus Liebig University, Giessen, Germany
| | - Torsten Hain
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, Germany
- German Center for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany
| | - Trinad Chakraborty
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, Germany
- German Center for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, Germany
- German Center for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany
| |
Collapse
|
10
|
Alonge M, Shumate A, Puiu D, Zimin AV, Salzberg SL. Chromosome-Scale Assembly of the Bread Wheat Genome Reveals Thousands of Additional Gene Copies. Genetics 2020; 216:599-608. [PMID: 32796007 PMCID: PMC7536849 DOI: 10.1534/genetics.120.303501] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Accepted: 08/10/2020] [Indexed: 11/18/2022] Open
Abstract
Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.
Collapse
Affiliation(s)
- Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland 21205
| |
Collapse
|
11
|
Maggi J, Roberts L, Koller S, Rebello G, Berger W, Ramesar R. De Novo Assembly-Based Analysis of RPGR Exon ORF15 in an Indigenous African Cohort Overcomes Limitations of a Standard Next-Generation Sequencing (NGS) Data Analysis Pipeline. Genes (Basel) 2020; 11:genes11070800. [PMID: 32679846 PMCID: PMC7396994 DOI: 10.3390/genes11070800] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 06/24/2020] [Accepted: 07/13/2020] [Indexed: 01/10/2023] Open
Abstract
RPGR exon ORF15 variants are one of the most frequent causes for inherited retinal disorders (IRDs), in particular retinitis pigmentosa. The low sequence complexity of this mutation hotspot makes it prone to indels and challenging for sequence data analysis. Whole-exome sequencing generally fails to provide adequate coverage in this region. Therefore, complementary methods are needed to avoid false positives as well as negative results. In this study, next-generation sequencing (NGS) was used to sequence long-range PCR amplicons for an IRD cohort of African ancestry. By developing a novel secondary analysis pipeline based on de novo assembly, we were able to avoid the miscalling of variants generated by standard NGS analysis tools. We identified pathogenic variants in 11 patients (13% of the cohort), two of which have not been reported previously. We provide a novel and alternative end-to-end secondary analysis pipeline for targeted NGS of ORF15 that is less prone to false positive and negative variant calls.
Collapse
Affiliation(s)
- Jordi Maggi
- Institute of Medical Molecular Genetic, University of Zurich, 8952 Schlieren, Switzerland; (J.M.); (S.K.)
- Zurich Center for Integrative Human Physiology (ZIHP), University of Zurich, 8006 Zurich, Switzerland
| | - Lisa Roberts
- University of Cape Town/MRC Genomic and Precision Medicine Research Unit, Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine (IDM), Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa; (L.R.); (G.R.); (R.R.)
| | - Samuel Koller
- Institute of Medical Molecular Genetic, University of Zurich, 8952 Schlieren, Switzerland; (J.M.); (S.K.)
| | - George Rebello
- University of Cape Town/MRC Genomic and Precision Medicine Research Unit, Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine (IDM), Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa; (L.R.); (G.R.); (R.R.)
| | - Wolfgang Berger
- Institute of Medical Molecular Genetic, University of Zurich, 8952 Schlieren, Switzerland; (J.M.); (S.K.)
- Zurich Center for Integrative Human Physiology (ZIHP), University of Zurich, 8006 Zurich, Switzerland
- Neuroscience Center Zurich (ZNZ), University and ETH Zurich, 8006 Zurich, Switzerland
- Correspondence:
| | - Rajkumar Ramesar
- University of Cape Town/MRC Genomic and Precision Medicine Research Unit, Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine (IDM), Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa; (L.R.); (G.R.); (R.R.)
| |
Collapse
|
12
|
Klein J, Neilen M, van Verk M, Dutilh BE, Van den Ackerveken G. Genome reconstruction of the non-culturable spinach downy mildew Peronospora effusa by metagenome filtering. PLoS One 2020; 15:e0225808. [PMID: 32396560 PMCID: PMC7217449 DOI: 10.1371/journal.pone.0225808] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 04/24/2020] [Indexed: 01/27/2023] Open
Abstract
Peronospora effusa (previously known as P. farinosa f. sp. spinaciae, and here referred to as Pfs) is an obligate biotrophic oomycete that causes downy mildew on spinach (Spinacia oleracea). To combat this destructive many disease resistant cultivars have been bred and used. However, new Pfs races rapidly break the employed resistance genes. To get insight into the gene repertoire of Pfs and identify infection-related genes, the genome of the first reference race, Pfs1, was sequenced, assembled, and annotated. Due to the obligate biotrophic nature of this pathogen, material for DNA isolation can only be collected from infected spinach leaves that, however, also contain many other microorganisms. The obtained sequences can, therefore, be considered a metagenome. To filter and obtain Pfs sequences we utilized the CAT tool to taxonomically annotate ORFs residing on long sequences of a genome pre-assembly. This study is the first to show that CAT filtering performs well on eukaryotic contigs. Based on the taxonomy, determined on multiple ORFs, contaminating long sequences and corresponding reads were removed from the metagenome. Filtered reads were re-assembled to provide a clean and improved Pfs genome sequence of 32.4 Mbp consisting of 8,635 scaffolds. Transcript sequencing of a range of infection time points aided the prediction of a total of 13,277 gene models, including 99 RxLR(-like) effector, and 14 putative Crinkler genes. Comparative analysis identified common features in the predicted secretomes of different obligate biotrophic oomycetes, regardless of their phylogenetic distance. Their secretomes are generally smaller, compared to hemi-biotrophic and necrotrophic oomycete species. We observe a reduction in proteins involved in cell wall degradation, in Nep1-like proteins (NLPs), proteins with PAN/apple domains, and host translocated effectors. The genome of Pfs1 will be instrumental in studying downy mildew virulence and for understanding the molecular adaptations by which new isolates break spinach resistance.
Collapse
Affiliation(s)
- Joël Klein
- Department of Biology, Plant-Microbe Interactions, Utrecht University, Utrecht, The Netherlands
| | - Manon Neilen
- Department of Biology, Plant-Microbe Interactions, Utrecht University, Utrecht, The Netherlands
| | - Marcel van Verk
- Department of Biology, Plant-Microbe Interactions, Utrecht University, Utrecht, The Netherlands
- Crop Data Science, KeyGene, Wageningen, The Netherlands
| | - Bas E. Dutilh
- Department of Biology, Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, The Netherlands
| | - Guido Van den Ackerveken
- Department of Biology, Plant-Microbe Interactions, Utrecht University, Utrecht, The Netherlands
- * E-mail:
| |
Collapse
|
13
|
Sadat-Hosseini M, Bakhtiarizadeh MR, Boroomand N, Tohidfar M, Vahdati K. Combining independent de novo assemblies to optimize leaf transcriptome of Persian walnut. PLoS One 2020; 15:e0232005. [PMID: 32343733 PMCID: PMC7188282 DOI: 10.1371/journal.pone.0232005] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2019] [Accepted: 04/06/2020] [Indexed: 12/22/2022] Open
Abstract
Transcriptome resources can facilitate to increase yield and quality of walnuts. Finding the best transcriptome assembly has not been the subject of walnuts research as yet. This research generated 240,179,782 reads from 11 walnut leaves according to cDNA libraries. The reads provided a complete de novo transcriptome assembly. Fifteen different transcriptome assemblies were constructed from five different well-known assemblers used in scientific literature with different k-mer lengths (Bridger, BinPacker, SOAPdenovo-Trans, Trinity and SPAdes) as well as two merging approaches (EvidentialGene and Transfuse). Based on the four quality metrics of assembly, the results indicated an efficiency in the process of merging the assemblies after being generated by de novo assemblers. Finally, EvidentialGene was recognized as the best assembler for the de novo assembly of the leaf transcriptome in walnut. Among a total number of 183,191 transcripts which were generated by EvidentialGene, there were 109,413 transcripts capable of protein potential (59.72%) and 104,926 were recognized as ORFs (57.27%). In addition, 79,185 transcripts were predicted to exist with at least one hit to the Pfam database. A number of 3,931 transcription factors were identified by BLAST searching against PlnTFDB. Furthermore, 6,591 of the predicted peptide sequences contained signaling peptides, while 92,704 contained transmembrane domains. Comparison of the assembled transcripts with transcripts of the walnut and published genome assembly for the 'Chandler' cultivar using the BLAST algorithm led to identify a total number of 27,304 and 19,178 homologue transcripts, respectively. De novo transcriptomes in walnut leaves can be developed for the future studies in functional genomics and genetic studies of walnuts.
Collapse
Affiliation(s)
- Mohammad Sadat-Hosseini
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
- Department of Horticulture, Faculty of Agriculture, University of Jiroft, Jiroft, Iran
| | | | - Naser Boroomand
- Department of Soil Science, Faculty of Agriculture, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Masoud Tohidfar
- Department of Plant Biotechnology, Faculty of Life Science and Biotechnology, Shahid Beheshti University, Tehran, Iran
| | - Kourosh Vahdati
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
| |
Collapse
|
14
|
Jayakumar V, Ishii H, Seki M, Kumita W, Inoue T, Hase S, Sato K, Okano H, Sasaki E, Sakakibara Y. An improved de novo genome assembly of the common marmoset genome yields improved contiguity and increased mapping rates of sequence data. BMC Genomics 2020; 21:243. [PMID: 32241258 PMCID: PMC7114785 DOI: 10.1186/s12864-020-6657-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Accepted: 03/09/2020] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND The common marmoset (Callithrix jacchus) is one of the most studied primate model organisms. However, the marmoset genomes available in the public databases are highly fragmented and filled with sequence gaps, hindering research advances related to marmoset genomics and transcriptomics. RESULTS Here we utilize single-molecule, long-read sequence data to improve and update the existing genome assembly and report a near-complete genome of the common marmoset. The assembly is of 2.79 Gb size, with a contig N50 length of 6.37 Mb and a chromosomal scaffold N50 length of 143.91 Mb, representing the most contiguous and high-quality marmoset genome up to date. Approximately 90% of the assembled genome was represented in contigs longer than 1 Mb, with approximately 104-fold improvement in contiguity over the previously published marmoset genome. More than 98% of the gaps from the previously published genomes were filled successfully, which improved the mapping rates of genomic and transcriptomic data on to the assembled genome. CONCLUSIONS Altogether the updated, high-quality common marmoset genome assembly provide improvements at various levels over the previous versions of the marmoset genome assemblies. This will allow researchers working on primate genomics to apply the genome more efficiently for their genomic and transcriptomic sequence data.
Collapse
Affiliation(s)
- Vasanthan Jayakumar
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522 Japan
| | - Hiromi Ishii
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522 Japan
| | - Misato Seki
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522 Japan
| | - Wakako Kumita
- Department of Marmoset Biology and Medicine, Central Institute for Experimental Animals, Kawasaki, Kanagawa 210-0821 Japan
| | - Takashi Inoue
- Department of Marmoset Biology and Medicine, Central Institute for Experimental Animals, Kawasaki, Kanagawa 210-0821 Japan
| | - Sumitaka Hase
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522 Japan
| | - Kengo Sato
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522 Japan
| | - Hideyuki Okano
- Department of Physiology, Keio University School of Medicine, Shinjuku, Tokyo, 160-8582 Japan
- Laboratory for Marmoset Neural Architecture, RIKEN Center for Brain Science, Wako-shi, Saitama, 351-0198 Japan
| | - Erika Sasaki
- Department of Marmoset Biology and Medicine, Central Institute for Experimental Animals, Kawasaki, Kanagawa 210-0821 Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522 Japan
| |
Collapse
|
15
|
Linsmith G, Rombauts S, Montanari S, Deng CH, Celton JM, Guérif P, Liu C, Lohaus R, Zurn JD, Cestaro A, Bassil NV, Bakker LV, Schijlen E, Gardiner SE, Lespinasse Y, Durel CE, Velasco R, Neale DB, Chagné D, Van de Peer Y, Troggio M, Bianco L. Pseudo-chromosome-length genome assembly of a double haploid "Bartlett" pear (Pyrus communis L.). Gigascience 2019; 8:giz138. [PMID: 31816089 PMCID: PMC6901071 DOI: 10.1093/gigascience/giz138] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 10/18/2019] [Accepted: 10/30/2019] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND We report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII long-read sequencing, Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. The sample selected for sequencing is a double haploid derived from the same "Bartlett" reference pear that was previously sequenced. Sequencing of di-haploid plants makes assembly more tractable in highly heterozygous species such as P. communis. FINDINGS A total of 496.9 Mb corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 pear chromosomes. Approximately 50% (247 Mb) of the genome consists of repetitive sequences. Gene annotation confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted. CONCLUSIONS We showed that the use of a doubled-haploid plant is an effective solution to the problems presented by high levels of heterozygosity and duplication for the generation of high-quality genome assemblies. We present a high-quality chromosome-scale assembly of the European pear Pyrus communis and demostrate its high degree of synteny with the genomes of Malus x Domestica and Pyrus x bretschneideri.
Collapse
Affiliation(s)
- Gareth Linsmith
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| | - Stephane Rombauts
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
| | - Sara Montanari
- University of California Davis, Department of Plant Sciences, One Shields Ave, Davis, CA 95616, USA
| | - Cecilia H Deng
- The New Zealand Institute for Plant & Food Research Limited (PFR), Mt Albert Research Centre,120 Mt Albert Road, Sandringham, Auckland, 1025, New Zealand
| | - Jean-Marc Celton
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Philippe Guérif
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Chang Liu
- ZMBP, Allgemeine Genetik, Universität Tübingen, Auf der Morgenstelle 32, D-72076 Tübingen, Germany
| | - Rolf Lohaus
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
| | - Jason D Zurn
- USDA-ARS National Clonal Germplasm Repository, 33447 Peoria Road, Corvallis, OR 97333, USA
| | - Alessandro Cestaro
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| | - Nahla V Bassil
- USDA-ARS National Clonal Germplasm Repository, 33447 Peoria Road, Corvallis, OR 97333, USA
| | - Linda V Bakker
- Wageningen UR – Bioscience P.O. Box 16, 6700AA, Wageningen, The Netherlands
| | - Elio Schijlen
- Wageningen UR – Bioscience P.O. Box 16, 6700AA, Wageningen, The Netherlands
| | - Susan E Gardiner
- The New Zealand Institute for Plant & Food Research Limited (PFR), Palmerston North Research Centre, Palmerston North, New Zealand
| | - Yves Lespinasse
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Charles-Eric Durel
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Riccardo Velasco
- CREA Research Centre for Viticulture and Enology, Via XXVIII Aprile 26, 31015 Conegliano (TV), Italy
| | - David B Neale
- University of California Davis, Department of Plant Sciences, One Shields Ave, Davis, CA 95616, USA
| | - David Chagné
- The New Zealand Institute for Plant & Food Research Limited (PFR), Palmerston North Research Centre, Palmerston North, New Zealand
| | - Yves Van de Peer
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
- Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Roper street, Pretoria 0028, South Africa
| | - Michela Troggio
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| | - Luca Bianco
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| |
Collapse
|
16
|
Souza GM, Van Sluys MA, Lembke CG, Lee H, Margarido GRA, Hotta CT, Gaiarsa JW, Diniz AL, Oliveira MDM, Ferreira SDS, Nishiyama MY, ten-Caten F, Ragagnin GT, Andrade PDM, de Souza RF, Nicastro GG, Pandya R, Kim C, Guo H, Durham AM, Carneiro MS, Zhang J, Zhang X, Zhang Q, Ming R, Schatz MC, Davidson B, Paterson AH, Heckerman D. Assembly of the 373k gene space of the polyploid sugarcane genome reveals reservoirs of functional diversity in the world's leading biomass crop. Gigascience 2019; 8:giz129. [PMID: 31782791 PMCID: PMC6884061 DOI: 10.1093/gigascience/giz129] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 05/23/2019] [Accepted: 10/08/2019] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Sugarcane cultivars are polyploid interspecific hybrids of giant genomes, typically with 10-13 sets of chromosomes from 2 Saccharum species. The ploidy, hybridity, and size of the genome, estimated to have >10 Gb, pose a challenge for sequencing. RESULTS Here we present a gene space assembly of SP80-3280, including 373,869 putative genes and their potential regulatory regions. The alignment of single-copy genes in diploid grasses to the putative genes indicates that we could resolve 2-6 (up to 15) putative homo(eo)logs that are 99.1% identical within their coding sequences. Dissimilarities increase in their regulatory regions, and gene promoter analysis shows differences in regulatory elements within gene families that are expressed in a species-specific manner. We exemplify these differences for sucrose synthase (SuSy) and phenylalanine ammonia-lyase (PAL), 2 gene families central to carbon partitioning. SP80-3280 has particular regulatory elements involved in sucrose synthesis not found in the ancestor Saccharum spontaneum. PAL regulatory elements are found in co-expressed genes related to fiber synthesis within gene networks defined during plant growth and maturation. Comparison with sorghum reveals predominantly bi-allelic variations in sugarcane, consistent with the formation of 2 "subgenomes" after their divergence ∼3.8-4.6 million years ago and reveals single-nucleotide variants that may underlie their differences. CONCLUSIONS This assembly represents a large step towards a whole-genome assembly of a commercial sugarcane cultivar. It includes a rich diversity of genes and homo(eo)logous resolution for a representative fraction of the gene space, relevant to improve biomass and food production.
Collapse
Affiliation(s)
- Glaucia Mendes Souza
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Marie-Anne Van Sluys
- Departamento de Botânica, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 277, São Paulo, SP 05508-090, Brazil
| | - Carolina Gimiliani Lembke
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Hayan Lee
- Cold Spring Harbor Laboratory, One Bungtown Road, Koch Building #1119, Cold Spring Harbor, NY11724, United States of America
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CACA94598, United States of America
| | - Gabriel Rodrigues Alves Margarido
- Departamento de Genética, Escola Superior de Agricultura Luiz de Queiroz, Universidade de São Paulo, Avenida Pádua Dias, 11, Piracicaba, SP 13418-900, Brazil
| | - Carlos Takeshi Hotta
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Jonas Weissmann Gaiarsa
- Departamento de Botânica, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 277, São Paulo, SP 05508-090, Brazil
| | - Augusto Lima Diniz
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Mauro de Medeiros Oliveira
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Sávio de Siqueira Ferreira
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
- Departamento de Botânica, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 277, São Paulo, SP 05508-090, Brazil
| | - Milton Yutaka Nishiyama
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
- Laboratório Especial de Toxinologia Aplicada, Instituto Butantan, Av. Vital Brasil, 1500, São Paulo, SP05503-900, Brazil
| | - Felipe ten-Caten
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Geovani Tolfo Ragagnin
- Departamento de Botânica, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 277, São Paulo, SP 05508-090, Brazil
| | - Pablo de Morais Andrade
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Av. Prof. Lineu Prestes, 748, São Paulo, SP 05508-000, Brazil
| | - Robson Francisco de Souza
- Departamento de Microbiologia, Instituto de Ciências Biomédicas, Universidade de São Paulo, Av.Professor Lineu Prestes, 1734, São Paulo, SP 05508-900, Brazil
| | - Gianlucca Gonçalves Nicastro
- Departamento de Microbiologia, Instituto de Ciências Biomédicas, Universidade de São Paulo, Av.Professor Lineu Prestes, 1734, São Paulo, SP 05508-900, Brazil
| | - Ravi Pandya
- Microsoft Research, One Microsoft Way, Redmond, WA 98052, United States of America
| | - Changsoo Kim
- Plant Genome Mapping Laboratory, University of Georgia, 120 Green Street, Athens, GA 30602-7223,United States of America
- Department of Crop Science, Chungnam National University, 99 Daehak Ro Yuseong Gu, Deajeon,34134, South Korea
| | - Hui Guo
- Plant Genome Mapping Laboratory, University of Georgia, 120 Green Street, Athens, GA 30602-7223,United States of America
| | - Alan Mitchell Durham
- Departamento de Ciências da Computação, Instituto de Matemática e Estatística, Universidade de São Paulo, Rua do Matão, 1010, São Paulo, SP 05508-090, Brazil
| | - Monalisa Sampaio Carneiro
- Departamento de Biotecnologia e Produção Vegetal e Animal, Centro de Ciências Agrárias, Universidade Federal de São Carlos, Rodovia Washington Luis km 235, Araras, SP 13.565-905, Brazil
| | - Jisen Zhang
- FAFU and UIUC-SIB Joint Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, Shangxiadian Road, Fuzhou 350002, Fujian, China
| | - Xingtan Zhang
- FAFU and UIUC-SIB Joint Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, Shangxiadian Road, Fuzhou 350002, Fujian, China
| | - Qing Zhang
- FAFU and UIUC-SIB Joint Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, Shangxiadian Road, Fuzhou 350002, Fujian, China
| | - Ray Ming
- FAFU and UIUC-SIB Joint Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, Shangxiadian Road, Fuzhou 350002, Fujian, China
- Department of Plant Biology, University of Illinois at Urbana-Champaign, 201 W. Gregory Dr. Urbana, Urbana, Illinois 61801, United States of America
| | - Michael C Schatz
- Cold Spring Harbor Laboratory, One Bungtown Road, Koch Building #1119, Cold Spring Harbor, NY11724, United States of America
- Departments of Computer Science and Biology, Johns Hopkins University, 3400 North Charles Street,Baltimore, MD 21218-2608, United States of America
| | - Bob Davidson
- Microsoft Research, One Microsoft Way, Redmond, WA 98052, United States of America
| | - Andrew H Paterson
- Plant Genome Mapping Laboratory, University of Georgia, 120 Green Street, Athens, GA 30602-7223,United States of America
| | - David Heckerman
- Microsoft Research, One Microsoft Way, Redmond, WA 98052, United States of America
| |
Collapse
|
17
|
Kim HS, Jeon S, Kim C, Kim YK, Cho YS, Kim J, Blazyte A, Manica A, Lee S, Bhak J. Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information. Gigascience 2019; 8:giz125. [PMID: 31794015 PMCID: PMC6889754 DOI: 10.1093/gigascience/giz125] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Revised: 09/02/2019] [Accepted: 09/28/2019] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. FINDINGS We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. CONCLUSION The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.
Collapse
Affiliation(s)
- Hui-Su Kim
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Sungwon Jeon
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, UNIST-gil 50, Eonyang-eup, Ulju-gun, UNIST, Ulsan 44919, Republic of Korea
| | - Changjae Kim
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Yeon Kyung Kim
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Yun Sung Cho
- Clinomics Inc., UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Jungeun Kim
- Personal Genomics Institute, Genome Research Foundation, Osong saengmyong1ro, Cheongju 28160, Republic of Korea
| | - Asta Blazyte
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Andrea Manica
- Department of Zoology, Cambridge University, Downing street, Cambridge CB2 3EJ, UK
| | - Semin Lee
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, UNIST-gil 50, Eonyang-eup, Ulju-gun, UNIST, Ulsan 44919, Republic of Korea
| | - Jong Bhak
- KOGIC, Ulsan National Institute of Science and Technology (UNIST), UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, UNIST-gil 50, Eonyang-eup, Ulju-gun, UNIST, Ulsan 44919, Republic of Korea
- Clinomics Inc., UNIST-gil 50, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
- Personal Genomics Institute, Genome Research Foundation, Osong saengmyong1ro, Cheongju 28160, Republic of Korea
| |
Collapse
|
18
|
Abstract
De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic "contigs"-the output of assembly. These steps of graph construction and traversal, contribute to well over 90 percent of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler-one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans, and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality trade-offs, when compared against many state-of-the-art genome assemblers.
Collapse
|
19
|
Hölzer M, Marz M. De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. Gigascience 2019; 8:giz039. [PMID: 31077315 PMCID: PMC6511074 DOI: 10.1093/gigascience/giz039] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Revised: 12/21/2018] [Accepted: 03/09/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND In recent years, massively parallel complementary DNA sequencing (RNA sequencing [RNA-Seq]) has emerged as a fast, cost-effective, and robust technology to study entire transcriptomes in various manners. In particular, for non-model organisms and in the absence of an appropriate reference genome, RNA-Seq is used to reconstruct the transcriptome de novo. Although the de novo transcriptome assembly of non-model organisms has been on the rise recently and new tools are frequently developing, there is still a knowledge gap about which assembly software should be used to build a comprehensive de novo assembly. RESULTS Here, we present a large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life. Overall, we built >200 single assemblies and evaluated their performance on a combination of 20 biological-based and reference-free metrics. Our study is accompanied by a comprehensive and extensible Electronic Supplement that summarizes all data sets, assembly execution instructions, and evaluation results. Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared. Moreover, we observed species-specific differences in the performance of each assembler. No tool delivered the best results for all data sets. CONCLUSIONS We recommend a careful choice and normalization of evaluation metrics to select the best assembling results as a critical step in the reconstruction of a comprehensive de novo transcriptome assembly.
Collapse
Affiliation(s)
- Martin Hölzer
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany
| | - Manja Marz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany
- FLI Leibniz Institute for Age Research, Beutenbergstraße 11, 07743 Jena, Germany
| |
Collapse
|
20
|
Kingan SB, Heaton H, Cudini J, Lambert CC, Baybayan P, Galvin BD, Durbin R, Korlach J, Lawniczak MKN. A High-Quality De novo Genome Assembly from a Single Mosquito Using PacBio Sequencing. Genes (Basel) 2019; 10:E62. [PMID: 30669388 PMCID: PMC6357164 DOI: 10.3390/genes10010062] [Citation(s) in RCA: 79] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 01/14/2019] [Accepted: 01/15/2019] [Indexed: 12/15/2022] Open
Abstract
A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Collapse
Affiliation(s)
- Sarah B Kingan
- Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025, USA.
| | - Haynes Heaton
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.
| | - Juliana Cudini
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.
| | | | - Primo Baybayan
- Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025, USA.
| | - Brendan D Galvin
- Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025, USA.
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK.
| | - Jonas Korlach
- Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025, USA.
| | - Mara K N Lawniczak
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.
| |
Collapse
|
21
|
Xu GC, Xu TJ, Zhu R, Zhang Y, Li SQ, Wang HW, Li JT. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. Gigascience 2019. [PMID: 30576505 DOI: 10.5524/100540] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2023] Open
Abstract
BACKGROUND Completing a genome is an important goal of genome assembly. However, many assemblies, including reference assemblies, are unfinished and have a number of gaps. Long reads obtained from third-generation sequencing (TGS) platforms can help close these gaps and improve assembly contiguity. However, current gap-closure approaches using long reads require extensive runtime and high memory usage. Thus, a fast and memory-efficient approach using long reads is needed to obtain complete genomes. FINDINGS We developed LR_Gapcloser to rapidly and efficiently close the gaps in genome assembly. This tool utilizes long reads generated from TGS sequencing platforms. Tested on de novo assembled gaps, repeat-derived gaps, and real gaps, LR_Gapcloser closed a higher number of gaps faster and with a lower error rate and a much lower memory usage than two existing, state-of-the art tools. This tool utilized raw reads to fill more gaps than when using error-corrected reads. It is applicable to gaps in the assemblies by different approaches and from large and complex genomes. After performing gap-closure using this tool, the contig N50 size of the human CHM1 genome was improved from 143 kb to 19 Mb, a 132-fold increase. We also closed the gaps in the Triticum urartu genome, a large genome rich in repeats; the contig N50 size was increased by 40%. Further, we evaluated the contiguity and correctness of six hybrid assembly strategies by combining the optimal TGS-based and next-generation sequencing-based assemblers with LR_Gapcloser. A proposed and optimal hybrid strategy generated a new human CHM1 genome assembly with marked contiguity. The contig N50 value was greater than 28 Mb, which is larger than previous non-reference assemblies of the diploid human genome. CONCLUSIONS LR_Gapcloser is a fast and efficient tool that can be used to close gaps and improve the contiguity of genome assemblies. A proposed hybrid assembly including this tool promises reference-grade assemblies. The software is available at http://www.fishbrowser.org/software/LR_Gapcloser/.
Collapse
Affiliation(s)
- Gui-Cai Xu
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
- College of Marine Science, Zhejiang Ocean University, 1 Haida South Road, Zhoushan, 316022, China
| | - Tian-Jun Xu
- College of Marine Science, Zhejiang Ocean University, 1 Haida South Road, Zhoushan, 316022, China
| | - Rui Zhu
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
- College of Fisheries and Life Science, Shanghai Ocean University, 999 Huchenghuan Road, Shanghai, 201306, China
| | - Yan Zhang
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| | - Shang-Qi Li
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| | - Hong-Wei Wang
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| | - Jiong-Tang Li
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| |
Collapse
|
22
|
Abstract
BACKGROUND Completing a genome is an important goal of genome assembly. However, many assemblies, including reference assemblies, are unfinished and have a number of gaps. Long reads obtained from third-generation sequencing (TGS) platforms can help close these gaps and improve assembly contiguity. However, current gap-closure approaches using long reads require extensive runtime and high memory usage. Thus, a fast and memory-efficient approach using long reads is needed to obtain complete genomes. FINDINGS We developed LR_Gapcloser to rapidly and efficiently close the gaps in genome assembly. This tool utilizes long reads generated from TGS sequencing platforms. Tested on de novo assembled gaps, repeat-derived gaps, and real gaps, LR_Gapcloser closed a higher number of gaps faster and with a lower error rate and a much lower memory usage than two existing, state-of-the art tools. This tool utilized raw reads to fill more gaps than when using error-corrected reads. It is applicable to gaps in the assemblies by different approaches and from large and complex genomes. After performing gap-closure using this tool, the contig N50 size of the human CHM1 genome was improved from 143 kb to 19 Mb, a 132-fold increase. We also closed the gaps in the Triticum urartu genome, a large genome rich in repeats; the contig N50 size was increased by 40%. Further, we evaluated the contiguity and correctness of six hybrid assembly strategies by combining the optimal TGS-based and next-generation sequencing-based assemblers with LR_Gapcloser. A proposed and optimal hybrid strategy generated a new human CHM1 genome assembly with marked contiguity. The contig N50 value was greater than 28 Mb, which is larger than previous non-reference assemblies of the diploid human genome. CONCLUSIONS LR_Gapcloser is a fast and efficient tool that can be used to close gaps and improve the contiguity of genome assemblies. A proposed hybrid assembly including this tool promises reference-grade assemblies. The software is available at http://www.fishbrowser.org/software/LR_Gapcloser/.
Collapse
Affiliation(s)
- Gui-Cai Xu
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
- College of Marine Science, Zhejiang Ocean University, 1 Haida South Road, Zhoushan, 316022, China
| | - Tian-Jun Xu
- College of Marine Science, Zhejiang Ocean University, 1 Haida South Road, Zhoushan, 316022, China
| | - Rui Zhu
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
- College of Fisheries and Life Science, Shanghai Ocean University, 999 Huchenghuan Road, Shanghai, 201306, China
| | - Yan Zhang
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| | - Shang-Qi Li
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| | - Hong-Wei Wang
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| | - Jiong-Tang Li
- Key Laboratory of Aquatic Genomics, Ministry of Agriculture and Rural Affairs, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, 150 Yongding Road, Beijing, 100141, China
| |
Collapse
|
23
|
Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane TM, Thybert D, Paten B, Pham S. Chromosome assembly of large and complex genomes using multiple references. Genome Res 2018; 28:1720-1732. [PMID: 30341161 PMCID: PMC6211643 DOI: 10.1101/gr.236273.118] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 09/24/2018] [Indexed: 11/25/2022]
Abstract
Despite the rapid development of sequencing technologies, the assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout 2, a reference-assisted assembly tool that works for large and complex genomes. By taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout 2 infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. By using Ragout 2, we transformed NGS assemblies of 16 laboratory mouse strains into sets of complete chromosomes, leaving <5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long Pacific Biosciences (PacBio) reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. We applied Ragout 2 to the Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared with other genomes from the Muridae family. Chromosome painting maps confirmed most large-scale rearrangements that Ragout 2 detected. We applied Ragout 2 to improve draft sequences of three ape genomes that have recently been published. Ragout 2 transformed three sets of contigs (generated using PacBio reads only) into chromosome-scale assemblies with accuracy comparable to chromosome assemblies generated in the original study using BioNano maps, Hi-C, BAC clones, and FISH.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California, San Diego, California 92093, USA
| | - Joel Armstrong
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | - Brian J Raney
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | - Ian Streeter
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Matthew Dunn
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Fengtang Yang
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Duncan Odom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
- Cancer Research UK Cambridge Institute, University of Cambridge, CB2 0RE Cambridge, United Kingdom
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Thomas M Keane
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
- School of Life Sciences, University of Nottingham, Nottingham NG7 2NR, United Kingdom
| | - David Thybert
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
- Earlham Institute, Norwich Research Park, Norwich NR4 7UG, United Kingdom
| | - Benedict Paten
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | - Son Pham
- BioTuring Incorporated, San Diego, California 92121, USA
| |
Collapse
|
24
|
Lee T, Kim MY, Ha J, Lee SH. Detection of large sequence insertions by a hybrid approach that combine de novo assembly and resequencing of medium-coverage genome sequences. Genome 2018; 61:745-754. [PMID: 30227080 DOI: 10.1139/gen-2018-0027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2023]
Abstract
Large sequence insertion (LSI) is one of the structural variations (SVs) that may cause phenotypic differences in plants. To identify the LSIs using medium-coverage sequencing data of four wild soybean (Glycine soja) genotypes, we designed a hybrid approach combining de novo assembly and read mapping. Total reads and reads with both ends unmapped were independently assembled into "ordinary contigs" and "orphan contigs", respectively, and subjected to pairwise alignment and stringent filtering. This approach predicted 24 LSIs averaging 2682 bp in size, with no overlap with SVs detected by Pindel, BreakDancer, or ScanIndel, and they were validated by PCR. Compared with the soybean (Glycine max) reference genome, 20 LSIs were located outside genic regions. One of the four LSIs within a genic region, LSI05, is located in the coding DNA sequence region of a protein kinase superfamily gene (Glyma.08G123500). It caused delayed translation initiation and loss of 24 amino acids in the wild soybean genotype CW12. LSI05 was more frequently observed in 29 G. soja accessions than in 34 G. max accessions. Identified LSIs would be genomic resources harboring novel gene contents for studying SVs and improving crops. Moreover, our cost-efficient approach may be applicable to other plant species.
Collapse
Affiliation(s)
- Taeyoung Lee
- a Department of Plant Science and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Republic of Korea
| | - Moon Young Kim
- a Department of Plant Science and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Republic of Korea
- b Plant Genomics and Breeding Institute, Seoul National University, Seoul 08826, Republic of Korea
| | - Jungmin Ha
- a Department of Plant Science and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Republic of Korea
- b Plant Genomics and Breeding Institute, Seoul National University, Seoul 08826, Republic of Korea
| | - Suk-Ha Lee
- a Department of Plant Science and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Republic of Korea
- b Plant Genomics and Breeding Institute, Seoul National University, Seoul 08826, Republic of Korea
| |
Collapse
|
25
|
|
26
|
Abstract
Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.
Collapse
Affiliation(s)
| | - Jing Wang
- Institute of Environmental Science and Research at the National Centre for Biosecurity and Infectious Disease, Upper Hutt, New Zealand
| | - Richard J. Hall
- Animal Health Laboratory, Investigation and Diagnostic Centres and Response, Ministry for Primary Industries—Manatū Ahu Matua, Upper Hutt, New Zealand
| |
Collapse
|
27
|
Besnard F, Koutsovoulos G, Dieudonné S, Blaxter M, Félix MA. Toward Universal Forward Genetics: Using a Draft Genome Sequence of the Nematode Oscheius tipulae To Identify Mutations Affecting Vulva Development. Genetics 2017; 206:1747-1761. [PMID: 28630114 PMCID: PMC5560785 DOI: 10.1534/genetics.117.203521] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 06/15/2017] [Indexed: 12/30/2022] Open
Abstract
Mapping-by-sequencing has become a standard method to map and identify phenotype-causing mutations in model species. Here, we show that a fragmented draft assembly is sufficient to perform mapping-by-sequencing in nonmodel species. We generated a draft assembly and annotation of the genome of the free-living nematode Oscheius tipulae, a distant relative of the model Caenorhabditis elegans We used this draft to identify the likely causative mutations at the O. tipulae cov-3 locus, which affect vulval development. The cov-3 locus encodes the O. tipulae ortholog of C. elegans mig-13, and we further show that Cel-mig-13 mutants also have an unsuspected vulval-development phenotype. In a virtuous circle, we were able to use the linkage information collected during mutant mapping to improve the genome assembly. These results showcase the promise of genome-enabled forward genetics in nonmodel species.
Collapse
Affiliation(s)
- Fabrice Besnard
- École Normale Supérieure, Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institut de Biologie de l'École Normale Supérieure, Paris Sciences et Lettres Research University, 75005, France
| | | | - Sana Dieudonné
- École Normale Supérieure, Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institut de Biologie de l'École Normale Supérieure, Paris Sciences et Lettres Research University, 75005, France
| | - Mark Blaxter
- Institute of Evolutionary Biology, University of Edinburgh, EH8 9YL, United Kingdom
| | - Marie-Anne Félix
- École Normale Supérieure, Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institut de Biologie de l'École Normale Supérieure, Paris Sciences et Lettres Research University, 75005, France
| |
Collapse
|
28
|
Zheng-Bradley X, Streeter I, Fairley S, Richardson D, Clarke L, Flicek P. Alignment of 1000 Genomes Project reads to reference assembly GRCh38. Gigascience 2017; 6:1-8. [PMID: 28531267 PMCID: PMC5522380 DOI: 10.1093/gigascience/gix038] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 03/29/2017] [Accepted: 05/19/2017] [Indexed: 12/30/2022] Open
Abstract
The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.
Collapse
Affiliation(s)
- Xiangqun Zheng-Bradley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ian Streeter
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - David Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | |
Collapse
|
29
|
Abstract
A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.
Collapse
Affiliation(s)
| | | | - Eric Rivals
- LIRMM, CNRS and Université de Montpellier, 34095 Montpellier, France
- Institut Biologie Computationnelle, CNRS and Université de Montpellier, 34095 Montpellier, France
| | | |
Collapse
|
30
|
Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res 2017. [PMID: 28130360 DOI: 10.1101/gr.2134c5.116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Tingting Zhu
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Guillaume Marçais
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - James A Yorke
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
- Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA
| | - Jan Dvořák
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
31
|
Clavijo BJ, Venturini L, Schudoma C, Accinelli GG, Kaithakottil G, Wright J, Borrill P, Kettleborough G, Heavens D, Chapman H, Lipscombe J, Barker T, Lu FH, McKenzie N, Raats D, Ramirez-Gonzalez RH, Coince A, Peel N, Percival-Alwyn L, Duncan O, Trösch J, Yu G, Bolser DM, Namaati G, Kerhornou A, Spannagl M, Gundlach H, Haberer G, Davey RP, Fosker C, Palma FD, Phillips AL, Millar AH, Kersey PJ, Uauy C, Krasileva KV, Swarbreck D, Bevan MW, Clark MD. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res 2017; 27:885-896. [PMID: 28420692 PMCID: PMC5411782 DOI: 10.1101/gr.217117.116] [Citation(s) in RCA: 243] [Impact Index Per Article: 34.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 03/14/2017] [Indexed: 01/16/2023]
Abstract
Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Tom Barker
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
| | - Fu-Hao Lu
- John Innes Centre, Norwich, NR4 7UH, United Kingdom
| | | | - Dina Raats
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
| | | | | | - Ned Peel
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
| | | | - Owen Duncan
- ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley Western Australia 6009, Australia
| | - Josua Trösch
- ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley Western Australia 6009, Australia
| | - Guotai Yu
- John Innes Centre, Norwich, NR4 7UH, United Kingdom
| | - Dan M Bolser
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | - Guy Namaati
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | - Arnaud Kerhornou
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | - Manuel Spannagl
- Plant Genome and Systems Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
| | - Heidrun Gundlach
- Plant Genome and Systems Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
| | - Georg Haberer
- Plant Genome and Systems Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
| | - Robert P Davey
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| | | | - Federica Di Palma
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| | | | - A Harvey Millar
- ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley Western Australia 6009, Australia
| | - Paul J Kersey
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | | | - Ksenia V Krasileva
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
- The Sainsbury Laboratory, Norwich, NR4 7UH, United Kingdom
| | - David Swarbreck
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| | | | - Matthew D Clark
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| |
Collapse
|
32
|
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res 2017; 27:768-777. [PMID: 28232478 PMCID: PMC5411771 DOI: 10.1101/gr.214346.116] [Citation(s) in RCA: 358] [Impact Index Per Article: 51.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2016] [Accepted: 02/14/2017] [Indexed: 01/19/2023]
Abstract
The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.
Collapse
Affiliation(s)
- Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Justin Chu
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Sarah Yeo
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Golnaz Jahesh
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Hamza Khan
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Rene L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| |
Collapse
|
33
|
Weissensteiner MH, Pang AWC, Bunikis I, Höijer I, Vinnere-Petterson O, Suh A, Wolf JBW. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res 2017; 27:697-708. [PMID: 28360231 PMCID: PMC5411765 DOI: 10.1101/gr.215095.116] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 03/10/2017] [Indexed: 12/27/2022]
Abstract
Accurate and contiguous genome assembly is key to a comprehensive understanding of the processes shaping genomic diversity and evolution. Yet, it is frequently constrained by constitutive heterochromatin, usually characterized by highly repetitive DNA. As a key feature of genome architecture associated with centromeric and subtelomeric regions, it locally influences meiotic recombination. In this study, we assess the impact of large tandem repeat arrays on the recombination rate landscape in an avian speciation model, the Eurasian crow. We assembled two high-quality genome references using single-molecule real-time sequencing (long-read assembly [LR]) and single-molecule optical maps (optical map assembly [OM]). A three-way comparison including the published short-read assembly (SR) constructed for the same individual allowed assessing assembly properties and pinpointing misassemblies. By combining information from all three assemblies, we characterized 36 previously unidentified large repetitive regions in the proximity of sequence assembly breakpoints, the majority of which contained complex arrays of a 14-kb satellite repeat or its 1.2-kb subunit. Using whole-genome population resequencing data, we estimated the population-scaled recombination rate (ρ) and found it to be significantly reduced in these regions. These findings are consistent with an effect of low recombination in regions adjacent to centromeric or subtelomeric heterochromatin and add to our understanding of the processes generating widespread heterogeneity in genetic diversity and differentiation along the genome. By combining three different technologies, our results highlight the importance of adding a layer of information on genome structure that is inaccessible to each approach independently.
Collapse
Affiliation(s)
- Matthias H Weissensteiner
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, SE-752 36 Uppsala, Sweden
- Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximilian University of Munich, 82152 Planegg-Martinsried, Germany
| | | | - Ignas Bunikis
- SciLife Lab Uppsala, Uppsala University SE-751 85 Uppsala, Sweden
| | - Ida Höijer
- SciLife Lab Uppsala, Uppsala University SE-751 85 Uppsala, Sweden
| | | | - Alexander Suh
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, SE-752 36 Uppsala, Sweden
| | - Jochen B W Wolf
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, SE-752 36 Uppsala, Sweden
- Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximilian University of Munich, 82152 Planegg-Martinsried, Germany
| |
Collapse
|
34
|
Dudchenko O, Batra SS, Omer AD, Nyquist SK, Hoeger M, Durand NC, Shamim MS, Machol I, Lander ES, Aiden AP, Aiden EL. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 2017; 356:92-95. [PMID: 28336562 PMCID: PMC5635820 DOI: 10.1126/science.aal3327] [Citation(s) in RCA: 1131] [Impact Index Per Article: 161.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2016] [Accepted: 03/13/2017] [Indexed: 01/04/2023]
Abstract
The Zika outbreak, spread by the Aedes aegypti mosquito, highlights the need to create high-quality assemblies of large genomes in a rapid and cost-effective way. Here we combine Hi-C data with existing draft assemblies to generate chromosome-length scaffolds. We validate this method by assembling a human genome, de novo, from short reads alone (67× coverage). We then combine our method with draft sequences to create genome assemblies of the mosquito disease vectors Aeaegypti and Culex quinquefasciatus, each consisting of three scaffolds corresponding to the three chromosomes in each species. These assemblies indicate that almost all genomic rearrangements among these species occur within, rather than between, chromosome arms. The genome assembly procedure we describe is fast, inexpensive, and accurate, and can be applied to many species.
Collapse
Affiliation(s)
- Olga Dudchenko
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
- Center for Theoretical and Biological Physics, Rice University, Houston, TX 77030, USA
| | - Sanjit S Batra
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Arina D Omer
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Sarah K Nyquist
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Marie Hoeger
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Neva C Durand
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Muhammad S Shamim
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Ido Machol
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
| | - Eric S Lander
- Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA
- Department of Biology, MIT, Cambridge, MA 02139, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Aviva Presser Aiden
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Bioengineering, Rice University, Houston, TX 77030, USA
- Department of Pediatrics, Texas Children's Hospital, Houston, TX 77030, USA
| | - Erez Lieberman Aiden
- The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Departments of Computer Science and Computational and Applied Mathematics, Rice University, Houston, TX 77030, USA
- Center for Theoretical and Biological Physics, Rice University, Houston, TX 77030, USA
- Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA
| |
Collapse
|
35
|
Guan R, Zhao Y, Zhang H, Fan G, Liu X, Zhou W, Shi C, Wang J, Liu W, Liang X, Fu Y, Ma K, Zhao L, Zhang F, Lu Z, Lee SMY, Xu X, Wang J, Yang H, Fu C, Ge S, Chen W. Draft genome of the living fossil Ginkgo biloba. Gigascience 2016. [PMID: 27871309 DOI: 10.1186/s13742-016-0154-1pmid:27871309] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2023] Open
Abstract
BACKGROUND Ginkgo biloba L. (Ginkgoaceae) is one of the most distinctive plants. It possesses a suite of fascinating characteristics including a large genome, outstanding resistance/tolerance to abiotic and biotic stresses, and dioecious reproduction, making it an ideal model species for biological studies. However, the lack of a high-quality genome sequence has been an impediment to our understanding of its biology and evolution. FINDINGS The 10.61 Gb genome sequence containing 41,840 annotated genes was assembled in the present study. Repetitive sequences account for 76.58% of the assembled sequence, and long terminal repeat retrotransposons (LTR-RTs) are particularly prevalent. The diversity and abundance of LTR-RTs is due to their gradual accumulation and a remarkable amplification between 16 and 24 million years ago, and they contribute to the long introns and large genome. Whole genome duplication (WGD) may have occurred twice, with an ancient WGD consistent with that shown to occur in other seed plants, and a more recent event specific to ginkgo. Abundant gene clusters from tandem duplication were also evident, and enrichment of expanded gene families indicates a remarkable array of chemical and antibacterial defense pathways. CONCLUSIONS The ginkgo genome consists mainly of LTR-RTs resulting from ancient gradual accumulation and two WGD events. The multiple defense mechanisms underlying the characteristic resilience of ginkgo are fostered by a remarkable enrichment in ancient duplicated and ginkgo-specific gene clusters. The present study sheds light on sequencing large genomes, and opens an avenue for further genetic and evolutionary research.
Collapse
Affiliation(s)
- Rui Guan
- BGI-Shenzhen, Shenzhen, 518083, China
- BGI-Qingdao, Qingdao, 266555, China
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | - Yunpeng Zhao
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China
| | - He Zhang
- BGI-Shenzhen, Shenzhen, 518083, China
- BGI-Qingdao, Qingdao, 266555, China
- Stanley Ho Centre for Emerging Infectious Diseases, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Guangyi Fan
- BGI-Shenzhen, Shenzhen, 518083, China
- BGI-Qingdao, Qingdao, 266555, China
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, Macao, China
| | - Xin Liu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Wenbin Zhou
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China
| | | | | | - Weiqing Liu
- BGI-Wuhan, BGI-Shenzhen, Wuhan, 430074, China
| | | | - Yuanyuan Fu
- BGI-Shenzhen, Shenzhen, 518083, China
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | | | - Lijun Zhao
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China
| | - Fumin Zhang
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | - Simon Ming-Yuen Lee
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, Macao, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Jian Wang
- BGI-Shenzhen, Shenzhen, 518083, China
- James D. Watson Institute of Genome Sciences, Hangzhou, 310058, China
| | - Huanming Yang
- BGI-Shenzhen, Shenzhen, 518083, China
- James D. Watson Institute of Genome Sciences, Hangzhou, 310058, China
| | - Chengxin Fu
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China.
| | - Song Ge
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China.
| | - Wenbin Chen
- BGI-Shenzhen, Shenzhen, 518083, China.
- BGI-Qingdao, Qingdao, 266555, China.
| |
Collapse
|
36
|
Guan R, Zhao Y, Zhang H, Fan G, Liu X, Zhou W, Shi C, Wang J, Liu W, Liang X, Fu Y, Ma K, Zhao L, Zhang F, Lu Z, Lee SMY, Xu X, Wang J, Yang H, Fu C, Ge S, Chen W. Draft genome of the living fossil Ginkgo biloba. Gigascience 2016; 5:49. [PMID: 27871309 PMCID: PMC5118899 DOI: 10.1186/s13742-016-0154-1] [Citation(s) in RCA: 153] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Accepted: 11/01/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ginkgo biloba L. (Ginkgoaceae) is one of the most distinctive plants. It possesses a suite of fascinating characteristics including a large genome, outstanding resistance/tolerance to abiotic and biotic stresses, and dioecious reproduction, making it an ideal model species for biological studies. However, the lack of a high-quality genome sequence has been an impediment to our understanding of its biology and evolution. FINDINGS The 10.61 Gb genome sequence containing 41,840 annotated genes was assembled in the present study. Repetitive sequences account for 76.58% of the assembled sequence, and long terminal repeat retrotransposons (LTR-RTs) are particularly prevalent. The diversity and abundance of LTR-RTs is due to their gradual accumulation and a remarkable amplification between 16 and 24 million years ago, and they contribute to the long introns and large genome. Whole genome duplication (WGD) may have occurred twice, with an ancient WGD consistent with that shown to occur in other seed plants, and a more recent event specific to ginkgo. Abundant gene clusters from tandem duplication were also evident, and enrichment of expanded gene families indicates a remarkable array of chemical and antibacterial defense pathways. CONCLUSIONS The ginkgo genome consists mainly of LTR-RTs resulting from ancient gradual accumulation and two WGD events. The multiple defense mechanisms underlying the characteristic resilience of ginkgo are fostered by a remarkable enrichment in ancient duplicated and ginkgo-specific gene clusters. The present study sheds light on sequencing large genomes, and opens an avenue for further genetic and evolutionary research.
Collapse
Affiliation(s)
- Rui Guan
- BGI-Shenzhen, Shenzhen, 518083, China
- BGI-Qingdao, Qingdao, 266555, China
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | - Yunpeng Zhao
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China
| | - He Zhang
- BGI-Shenzhen, Shenzhen, 518083, China
- BGI-Qingdao, Qingdao, 266555, China
- Stanley Ho Centre for Emerging Infectious Diseases, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Guangyi Fan
- BGI-Shenzhen, Shenzhen, 518083, China
- BGI-Qingdao, Qingdao, 266555, China
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, Macao, China
| | - Xin Liu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Wenbin Zhou
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China
| | | | | | - Weiqing Liu
- BGI-Wuhan, BGI-Shenzhen, Wuhan, 430074, China
| | | | - Yuanyuan Fu
- BGI-Shenzhen, Shenzhen, 518083, China
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | | | - Lijun Zhao
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China
| | - Fumin Zhang
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | - Simon Ming-Yuen Lee
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, Macao, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Jian Wang
- BGI-Shenzhen, Shenzhen, 518083, China
- James D. Watson Institute of Genome Sciences, Hangzhou, 310058, China
| | - Huanming Yang
- BGI-Shenzhen, Shenzhen, 518083, China
- James D. Watson Institute of Genome Sciences, Hangzhou, 310058, China
| | - Chengxin Fu
- The Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.
- Laboratory of Systematic & Evolutionary Botany and Biodiversity, Institute of Ecology and Conservation Center for Gene Resources of Endangered Wildlife, Zhejiang University, Hangzhou, 310058, China.
| | - Song Ge
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China.
| | - Wenbin Chen
- BGI-Shenzhen, Shenzhen, 518083, China.
- BGI-Qingdao, Qingdao, 266555, China.
| |
Collapse
|
37
|
Chawla V, Kumar R, Shankar R. Identifying wrong assemblies in de novo short read primary sequence assembly contigs. J Biosci 2016; 41:455-74. [PMID: 27581937 DOI: 10.1007/s12038-016-9630-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequenced all over the world. Most of these assemblies are done using some de novo short read assemblers and other related approaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuous dearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted or wrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing and assembling have been assessed for their role in causing mis-assembly by using different genome sequencing data. Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembled primary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simple unsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performing reasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that the proposed methodology may work as a complementary system to the existing tools to enhance their accuracy.
Collapse
Affiliation(s)
- Vandna Chawla
- Studio of Computational Biology and Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
| | | | | |
Collapse
|
38
|
Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli KP, O’Brien SJ. Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences. Gigascience 2016; 5:38. [PMID: 27549770 PMCID: PMC4994284 DOI: 10.1186/s13742-016-0141-6] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Accepted: 07/31/2016] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND As the number of sequenced genomes rapidly increases, chromosome assembly is becoming an even more crucial step of any genome study. Since de novo chromosome assemblies are confounded by repeat-mediated artifacts, reference-assisted assemblies that use comparative inference have become widely used, prompting the development of several reference-assisted assembly programs for prokaryotic and eukaryotic genomes. FINDINGS We developed Chromosomer - a reference-based genome arrangement tool, which rapidly builds chromosomes from genome contigs or scaffolds using their alignments to a reference genome of a closely related species. Chromosomer does not require mate-pair libraries and it offers a number of auxiliary tools that implement common operations accompanying the genome assembly process. CONCLUSIONS Despite implementing a straightforward alignment-based approach, Chromosomer is a useful tool for genomic analysis of species without chromosome maps. Putative chromosome assemblies by Chromosomer can be used in comparative genomic analysis, genomic variation assessment, potential linkage group inference and other kinds of analysis involving contig or scaffold mapping to a high-quality assembly.
Collapse
Affiliation(s)
- Gaik Tamazian
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Sredniy Prospekt 41A, St. Petersburg, 199004 Russia
| | - Pavel Dobrynin
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Sredniy Prospekt 41A, St. Petersburg, 199004 Russia
| | - Ksenia Krasheninnikova
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Sredniy Prospekt 41A, St. Petersburg, 199004 Russia
| | - Aleksey Komissarov
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Sredniy Prospekt 41A, St. Petersburg, 199004 Russia
| | - Klaus-Peter Koepfli
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Sredniy Prospekt 41A, St. Petersburg, 199004 Russia
- National Zoology Park, Smithsonian Conservation Biology Institute, 3001 Connecticut Avenue NW, Washington, 20008 D.C. USA
| | - Stephen J. O’Brien
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Sredniy Prospekt 41A, St. Petersburg, 199004 Russia
- Oceanographic Center, Nova Southeastern University, 8000 N. Ocean Drive, Ft. Lauderdave, 33004 Florida USA
| |
Collapse
|
39
|
Cápal P, Blavet N, Vrána J, Kubaláková M, Doležel J. Multiple displacement amplification of the DNA from single flow-sorted plant chromosome. Plant J 2015; 84:838-844. [PMID: 26400218 DOI: 10.1111/tpj.13035] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Revised: 09/13/2015] [Accepted: 09/17/2015] [Indexed: 06/05/2023]
Abstract
A protocol is described for production of micrograms of DNA from single copies of flow-sorted plant chromosomes. Of 183 single copies of wheat chromosome 3B, 118 (64%) were successfully amplified. Sequencing DNA amplification products using an Illumina HiSeq 2000 system to 10× coverage and merging sequences from three separate amplifications resulted in 60% coverage of the chromosome 3B reference, entirely covering 30% of its genes. The merged sequences permitted de novo assembly of 19% of chromosome 3B genes, with 10% of genes contained in a single contig, and 39% of genes covered for at least 80% of their length. The chromosome-derived sequences allowed identification of missing genic sequences in the chromosome 3B reference and short sequences similar to 3B in survey sequences of other wheat chromosomes. These observations indicate that single-chromosome sequencing is suitable to identify genic sequences on particular chromosomes, to develop chromosome-specific DNA markers, to verify assignment of DNA sequence contigs to individual pseudomolecules, and to validate whole-genome assemblies. The protocol expands the potential of chromosome genomics, which may now be applied to any plant species from which chromosome samples suitable for flow cytometry can be prepared, and opens new avenues for studies on chromosome structural heterozygosity and haplotype phasing in plants.
Collapse
Affiliation(s)
- Petr Cápal
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Šlechtitelů 31, CZ-78371, Olomouc, Czech Republic
| | - Nicolas Blavet
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Šlechtitelů 31, CZ-78371, Olomouc, Czech Republic
- Palacký University Olomouc, Centre of the Region Haná for Biotechnological and Agricultural Research, Šlechtitelů 27, CZ-78371, Olomouc, Czech Republic
| | - Jan Vrána
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Šlechtitelů 31, CZ-78371, Olomouc, Czech Republic
| | - Marie Kubaláková
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Šlechtitelů 31, CZ-78371, Olomouc, Czech Republic
| | - Jaroslav Doležel
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Šlechtitelů 31, CZ-78371, Olomouc, Czech Republic
| |
Collapse
|
40
|
Abstract
Obtaining bacterial genomic sequences has become a routine task in today's biology. The emergence of the comparative genomics approach has led to an increasing number of bacterial species having more than one strain sequenced, thus facilitating the annotation process. On the other hand, many genomic sequences are now left in the "draft" status, as a series of contigs, mainly for the labor-intensive finishing task. As a result, many genomic analyses are incomplete (e.g., in their annotation) or impossible to be performed (e.g., structural genomics analysis). Many approaches have been recently developed to facilitate the finishing process or at least to produce higher quality scaffolds; taking advantage of the comparative genomics paradigm, closely related genomes are used to align the contigs and determine their relative orientation, thus facilitating the finishing process, but also producing higher quality scaffolds. In this chapter we present the use of the CONTIGuator algorithm, which aligns the contigs from a draft genome to a closely related closed genome and resolves their relative orientation based on this alignment, producing a scaffold and a series of PCR primer pairs for the finishing process. The CONTIGuator algorithm is also capable of handling multipartite genomes (i.e., genomes having chromosomes and other plasmids), univocally mapping contigs to the most similar replicon. The program also produces a series of contig maps that allow to perform structural genomics analysis on the draft genome. The functionalities of the web interface, as well as the command line version, are presented.
Collapse
Affiliation(s)
- Marco Galardini
- Department of Biology, University of Florence, Florence, Italy,
| | | | | |
Collapse
|
41
|
Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PLoS One 2015; 10:e0129059. [PMID: 26098299 PMCID: PMC4476701 DOI: 10.1371/journal.pone.0129059] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Accepted: 05/04/2015] [Indexed: 12/18/2022] Open
Abstract
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.
Collapse
Affiliation(s)
- Gordon M. Daly
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | - Richard M. Leggett
- The Genome Analysis Centre (TGAC), Norwich Research Park, Norwich, NR47UH, United Kingdom
| | - William Rowe
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | - Samuel Stubbs
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | - Maxim Wilkinson
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | | | - Mario Caccamo
- The Genome Analysis Centre (TGAC), Norwich Research Park, Norwich, NR47UH, United Kingdom
| | - William Bernal
- Institute of Liver Studies, King's College Hospital, Denmark Hill, London, SE59RS, United Kingdom
| | - Jonathan L. Heeney
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
- * E-mail:
| |
Collapse
|
42
|
Akpinar BA, Magni F, Yuce M, Lucas SJ, Šimková H, Šafář J, Vautrin S, Bergès H, Cattonaro F, Doležel J, Budak H. The physical map of wheat chromosome 5DS revealed gene duplications and small rearrangements. BMC Genomics 2015; 16:453. [PMID: 26070810 PMCID: PMC4465308 DOI: 10.1186/s12864-015-1641-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Accepted: 05/19/2015] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The substantially large bread wheat genome, organized into highly similar three sub-genomes, renders genomic research challenging. The construction of BAC-based physical maps of individual chromosomes reduces the complexity of this allohexaploid genome, enables elucidation of gene space and evolutionary relationships, provides tools for map-based cloning, and serves as a framework for reference sequencing efforts. In this study, we constructed the first comprehensive physical map of wheat chromosome arm 5DS, thereby exploring its gene space organization and evolution. RESULTS The physical map of 5DS was comprised of 164 contigs, of which 45 were organized into 21 supercontigs, covering 176 Mb with an N50 value of 2,173 kb. Fifty-eight of the contigs were larger than 1 Mb, with the largest contig spanning 6,649 kb. A total of 1,864 molecular markers were assigned to the map at a density of 10.5 markers/Mb, anchoring 100 of the 120 contigs (>5 clones) that constitute ~95 % of the cumulative length of the map. Ordering of 80 contigs along the deletion bins of chromosome arm 5DS revealed small-scale breaks in syntenic blocks. Analysis of the gene space of 5DS suggested an increasing gradient of genes organized in islands towards the telomere, with the highest gene density of 5.17 genes/Mb in the 0.67-0.78 deletion bin, 1.4 to 1.6 times that of all other bins. CONCLUSIONS Here, we provide a chromosome-specific view into the organization and evolution of the D genome of bread wheat, in comparison to one of its ancestors, revealing recent genome rearrangements. The high-quality physical map constructed in this study paves the way for the assembly of a reference sequence, from which breeding efforts will greatly benefit.
Collapse
Affiliation(s)
- Bala Ani Akpinar
- Sabanci University Nanotechnology Research and Application Centre (SUNUM), Sabanci University, Universite Cad. Orta Mah. No: 27, Tuzla, 34956, Istanbul, Turkey.
| | - Federica Magni
- Instituto di Genomica Applicata, Via J.Linussio 51, Udine, 33100, Italy.
| | - Meral Yuce
- Sabanci University Nanotechnology Research and Application Centre (SUNUM), Sabanci University, Universite Cad. Orta Mah. No: 27, Tuzla, 34956, Istanbul, Turkey.
| | - Stuart J Lucas
- Sabanci University Nanotechnology Research and Application Centre (SUNUM), Sabanci University, Universite Cad. Orta Mah. No: 27, Tuzla, 34956, Istanbul, Turkey.
| | - Hana Šimková
- Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental Botany, CZ-78371, Olomouc, Czech Republic.
| | - Jan Šafář
- Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental Botany, CZ-78371, Olomouc, Czech Republic.
| | - Sonia Vautrin
- Centre Nationales Ressources Génomiques Végétales, INRA UPR 1258, 24 Chemin de Borde Rouge - Auzeville 31326, Castanet-Tolosan, France.
| | - Hélène Bergès
- Centre Nationales Ressources Génomiques Végétales, INRA UPR 1258, 24 Chemin de Borde Rouge - Auzeville 31326, Castanet-Tolosan, France.
| | - Federica Cattonaro
- Instituto di Genomica Applicata, Via J.Linussio 51, Udine, 33100, Italy.
| | - Jaroslav Doležel
- Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental Botany, CZ-78371, Olomouc, Czech Republic.
| | - Hikmet Budak
- Sabanci University Nanotechnology Research and Application Centre (SUNUM), Sabanci University, Universite Cad. Orta Mah. No: 27, Tuzla, 34956, Istanbul, Turkey.
- Molecular Biology, Genetics and Bioengineering Program, Sabanci University, 34956, Istanbul, Turkey.
| |
Collapse
|
43
|
Song G, Dickins BJA, Demeter J, Engel S, Dunn B, Cherry JM. AGAPE (Automated Genome Analysis PipelinE) for pan-genome analysis of Saccharomyces cerevisiae. PLoS One 2015; 10:e0120671. [PMID: 25781462 PMCID: PMC4363492 DOI: 10.1371/journal.pone.0120671] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Accepted: 01/25/2015] [Indexed: 11/24/2022] Open
Abstract
The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community.
Collapse
Affiliation(s)
- Giltae Song
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
- * E-mail:
| | - Benjamin J. A. Dickins
- School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
| | - Janos Demeter
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - Stacia Engel
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - Barbara Dunn
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - J. Michael Cherry
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| |
Collapse
|
44
|
Abstract
The recently developed next generation sequencing platforms not only decrease the cost for metagenomics data analysis, but also greatly enlarge the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between the noise of the resulting contigs and the gain in sequence length for better annotation has not been attended enough for large-scale sequencing projects, especially for the datasets with low coverage and a large number of nonoverlapping contigs. To address this limitation and promote both accuracy and efficiency, we develop a novel metagenomic sequence assembly framework, DIME, by taking the DIvide, conquer, and MErge strategies. In addition, we give two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, on Apache Hadoop platform. For a systematic comparison of the performance of the assembly tasks, we tested DIME and five other popular short read assembly programs, Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes on four synthetic and three real metagenomic sequence datasets with various reads from fifty thousand to a couple million in size. The experimental results demonstrate that our method not only partitions the sequence reads with an extremely high accuracy, but also reconstructs more bases, generates higher quality assembled consensus, and yields higher assembly scores, including corrected N50 and BLAST-score-per-base, than other tools with a nearly theoretical speed-up. Results indicate that DIME offers great improvement in assembly across a range of sequence abundances and thus is robust to decreasing coverage.
Collapse
Affiliation(s)
- Xuan Guo
- Departments of Computer Science and Biology, Georgia State University, Atlanta, Georgia
| | - Ning Yu
- Departments of Computer Science and Biology, Georgia State University, Atlanta, Georgia
| | - Xiaojun Ding
- School of Information Science and Engineering, Central South University, Changsha, Hunan, China
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, Changsha, Hunan, China
| | - Yi Pan
- Departments of Computer Science and Biology, Georgia State University, Atlanta, Georgia
| |
Collapse
|
45
|
Abstract
Based on reversible dye-terminators technology, the Illumina-solexa sequencing platform enables rapid sequencing-by-synthesis (SBS) of large DNA stretches spanning entire genomes, with the latest instruments capable of producing hundreds of gigabases of data in a single sequencing run. Illumina's NGS instruments powerfully combine the flexibility of single reads with short- and long-insert paired-end reads, and enable a wide range of DNA sequencing applications. Here, we describe the paired-end library preparation with an average insert size of 470 bp, 2 kbp, and 6 kbp, together with the DNA cluster generation and sequencing procedure of E. coli O104:H4 genome on Illumina Hiseq 2000 platform.
Collapse
Affiliation(s)
- Zhenfei Hu
- BGI, Beishan Road, Shenzen, 518083, China
| | | | | |
Collapse
|
46
|
Abstract
Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
Collapse
|
47
|
Orlandini V, Fondi M, Fani R. Methods for assembling reads and producing contigs. Methods Mol Biol 2015; 1231:151-161. [PMID: 25343864 DOI: 10.1007/978-1-4939-1720-4_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Determining the genome sequence of an organism is often the first step towards its molecular characterization. Once a difficult and expensive task, nowadays it is an almost routine practice in many molecular biology labs. In this chapter we discuss in depth the various methods to assemble the short sequences (called reads) obtained from a massive sequencing system, using different software and strategies, and how to perform some fundamental quality controls on the data obtained.
Collapse
Affiliation(s)
- Valerio Orlandini
- Department of Biology, University of Florence, via Madonna del Piano 6, Sesto Fiorentino, Firenze, 50019, Italy,
| | | | | |
Collapse
|
48
|
Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 2013; 31:1119-1125. [PMID: 24185095 DOI: 10.1038/nbt2727] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2013] [Accepted: 10/02/2013] [Indexed: 05/19/2023]
Abstract
Genomes assembled de novo from short reads are highly fragmented relative to the finished chromosomes of Homo sapiens and key model organisms generated by the Human Genome Project. To address this problem, we need scalable, cost-effective methods to obtain assemblies with chromosome-scale contiguity. Here we show that genome-wide chromatin interaction data sets, such as those generated by Hi-C, are a rich source of long-range information for assigning, ordering and orienting genomic sequences to chromosomes, including across centromeres. To exploit this finding, we developed an algorithm that uses Hi-C data for ultra-long-range scaffolding of de novo genome assemblies. We demonstrate the approach by combining shotgun fragment and short jump mate-pair sequences with Hi-C data to generate chromosome-scale de novo assemblies of the human, mouse and Drosophila genomes, achieving--for the human genome--98% accuracy in assigning scaffolds to chromosome groups and 99% accuracy in ordering and orienting scaffolds within chromosome groups. Hi-C data can also be used to validate chromosomal translocations in cancer genomes.
Collapse
Affiliation(s)
- Joshua N Burton
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | | | | | | | | | | |
Collapse
|
49
|
Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 2013; 31:1119-25. [PMID: 24185095 PMCID: PMC4117202 DOI: 10.1038/nbt.2727] [Citation(s) in RCA: 854] [Impact Index Per Article: 77.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2013] [Accepted: 10/02/2013] [Indexed: 12/21/2022]
Abstract
Genomes assembled de novo from short reads are highly fragmented relative to the finished chromosomes of Homo sapiens and key model organisms generated by the Human Genome Project. To address this problem, we need scalable, cost-effective methods to obtain assemblies with chromosome-scale contiguity. Here we show that genome-wide chromatin interaction data sets, such as those generated by Hi-C, are a rich source of long-range information for assigning, ordering and orienting genomic sequences to chromosomes, including across centromeres. To exploit this finding, we developed an algorithm that uses Hi-C data for ultra-long-range scaffolding of de novo genome assemblies. We demonstrate the approach by combining shotgun fragment and short jump mate-pair sequences with Hi-C data to generate chromosome-scale de novo assemblies of the human, mouse and Drosophila genomes, achieving--for the human genome--98% accuracy in assigning scaffolds to chromosome groups and 99% accuracy in ordering and orienting scaffolds within chromosome groups. Hi-C data can also be used to validate chromosomal translocations in cancer genomes.
Collapse
Affiliation(s)
- Joshua N. Burton
- Department of Genome Sciences, University of Washington, Seattle, WA 98115, USA
| | - Andrew Adey
- Department of Genome Sciences, University of Washington, Seattle, WA 98115, USA
| | | | - Ruolan Qiu
- Department of Genome Sciences, University of Washington, Seattle, WA 98115, USA
| | - Jacob O. Kitzman
- Department of Genome Sciences, University of Washington, Seattle, WA 98115, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA 98115, USA
| |
Collapse
|
50
|
Jiang Y, Ninwichian P, Liu S, Zhang J, Kucuktas H, Sun F, Kaltenboeck L, Sun L, Bao L, Liu Z. Generation of physical map contig-specific sequences useful for whole genome sequence scaffolding. PLoS One 2013; 8:e78872. [PMID: 24205335 PMCID: PMC3811975 DOI: 10.1371/journal.pone.0078872] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Accepted: 09/16/2013] [Indexed: 11/29/2022] Open
Abstract
Along with the rapid advances of the nextgen sequencing technologies, more and more species are added to the list of organisms whose whole genomes are sequenced. However, the assembled draft genome of many organisms consists of numerous small contigs, due to the short length of the reads generated by nextgen sequencing platforms. In order to improve the assembly and bring the genome contigs together, more genome resources are needed. In this study, we developed a strategy to generate a valuable genome resource, physical map contig-specific sequences, which are randomly distributed genome sequences in each physical contig. Two-dimensional tagging method was used to create specific tags for 1,824 physical contigs, in which the cost was dramatically reduced. A total of 94,111,841 100-bp reads and 315,277 assembled contigs are identified containing physical map contig-specific tags. The physical map contig-specific sequences along with the currently available BAC end sequences were then used to anchor the catfish draft genome contigs. A total of 156,457 genome contigs (~79% of whole genome sequencing assembly) were anchored and grouped into 1,824 pools, in which 16,680 unique genes were annotated. The physical map contig-specific sequences are valuable resources to link physical map, genetic linkage map and draft whole genome sequences, consequently have the capability to improve the whole genome sequences assembly and scaffolding, and improve the genome-wide comparative analysis as well. The strategy developed in this study could also be adopted in other species whose whole genome assembly is still facing a challenge.
Collapse
Affiliation(s)
- Yanliang Jiang
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Parichart Ninwichian
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Shikai Liu
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Jiaren Zhang
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Huseyin Kucuktas
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Fanyue Sun
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Ludmilla Kaltenboeck
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Luyang Sun
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Lisui Bao
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Zhanjiang Liu
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
- * E-mail:
| |
Collapse
|