1
|
Hopkins CE, Brock T, Caulfield TR, Bainbridge M. Phenotypic screening models for rapid diagnosis of genetic variants and discovery of personalized therapeutics. Mol Aspects Med 2022; 91:101153. [PMID: 36411139 PMCID: PMC10073243 DOI: 10.1016/j.mam.2022.101153] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 10/22/2022] [Accepted: 10/23/2022] [Indexed: 11/19/2022]
Abstract
Precision medicine strives for highly individualized treatments for disease under the notion that each individual's unique genetic makeup and environmental exposures imprints upon them not only a disposition to illness, but also an optimal therapeutic approach. In the realm of rare disorders, genetic predisposition is often the predominant mechanism driving disease presentation. For such, mostly, monogenic disorders, a causal gene to phenotype association is likely. As a result, it becomes important to query the patient's genome for the presence of pathogenic variations that are likely to cause the disease. Determining whether a variant is pathogenic or not is critical to these analyses and can be challenging, as many disease-causing variants are novel and, ergo, have no available functional data to help categorize them. This problem is exacerbated by the need for rapid evaluation of pathogenicity, since many genetic diseases present in young children who will experience increased morbidity and mortality without rapid diagnosis and therapeutics. Here, we discuss the utility of animal models, with a focus mainly on C. elegans, as a contrast to tissue culture and in silico approaches, with emphasis on how these systems are used in determining pathogenicity of variants with uncertain significance and then used to screen for novel therapeutics.
Collapse
Affiliation(s)
| | | | - Thomas R Caulfield
- Mayo Clinic, Department of Neuroscience, Department of Computational Biology, Department of Clinical Genomics, Jacksonville, FL, 32224, Rochester, MN, 55905, USA
| | | |
Collapse
|
2
|
Muñoz-Montecinos C, Romero A, Sepúlveda V, Vira MÁ, Fehrmann-Cartes K, Marcellini S, Aguilera F, Caprile T, Fuentes R. Turning the Curve Into Straight: Phenogenetics of the Spine Morphology and Coordinate Maintenance in the Zebrafish. Front Cell Dev Biol 2022; 9:801652. [PMID: 35155449 PMCID: PMC8826430 DOI: 10.3389/fcell.2021.801652] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 12/31/2021] [Indexed: 12/13/2022] Open
Abstract
The vertebral column, or spine, provides mechanical support and determines body axis posture and motion. The most common malformation altering spine morphology and function is adolescent idiopathic scoliosis (AIS), a three-dimensional spinal deformity that affects approximately 4% of the population worldwide. Due to AIS genetic heterogenicity and the lack of suitable animal models for its study, the etiology of this condition remains unclear, thus limiting treatment options. We here review current advances in zebrafish phenogenetics concerning AIS-like models and highlight the recently discovered biological processes leading to spine malformations. First, we focus on gene functions and phenotypes controlling critical aspects of postembryonic aspects that prime in spine architecture development and straightening. Second, we summarize how primary cilia assembly and biomechanical stimulus transduction, cerebrospinal fluid components and flow driven by motile cilia have been implicated in the pathogenesis of AIS-like phenotypes. Third, we highlight the inflammatory responses associated with scoliosis. We finally discuss recent innovations and methodologies for morphometrically characterize and analyze the zebrafish spine. Ongoing phenotyping projects are expected to identify novel and unprecedented postembryonic gene functions controlling spine morphology and mutant models of AIS. Importantly, imaging and gene editing technologies are allowing deep phenotyping studies in the zebrafish, opening new experimental paradigms in the morphometric and three-dimensional assessment of spinal malformations. In the future, fully elucidating the phenogenetic underpinnings of AIS etiology in zebrafish and humans will undoubtedly lead to innovative pharmacological treatments against spinal deformities.
Collapse
Affiliation(s)
- Carlos Muñoz-Montecinos
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Adrián Romero
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Vania Sepúlveda
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - María Ángela Vira
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Karen Fehrmann-Cartes
- Núcleo de Investigaciones Aplicadas en Ciencias Veterinarias y Agronómicas, Universidad de las Américas, Concepción, Chile
| | - Sylvain Marcellini
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Felipe Aguilera
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Departamento de Bioquímica y Biología Molecular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Teresa Caprile
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Ricardo Fuentes
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
- Grupo de Procesos en Biología del Desarrollo (GDeP), Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| |
Collapse
|
3
|
Fouks B, Brand P, Nguyen HN, Herman J, Camara F, Ence D, Hagen DE, Hoff KJ, Nachweide S, Romoth L, Walden KKO, Guigo R, Stanke M, Narzisi G, Yandell M, Robertson HM, Koeniger N, Chantawannakul P, Schatz MC, Worley KC, Robinson GE, Elsik CG, Rueppell O. The genomic basis of evolutionary differentiation among honey bees. Genome Res 2021; 31:1203-1215. [PMID: 33947700 PMCID: PMC8256857 DOI: 10.1101/gr.272310.120] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 04/22/2021] [Indexed: 02/06/2023]
Abstract
In contrast to the western honey bee, Apis mellifera, other honey bee species have been largely neglected despite their importance and diversity. The genetic basis of the evolutionary diversification of honey bees remains largely unknown. Here, we provide a genome-wide comparison of three honey bee species, each representing one of the three subgenera of honey bees, namely the dwarf (Apis florea), giant (A. dorsata), and cavity-nesting (A. mellifera) honey bees with bumblebees as an outgroup. Our analyses resolve the phylogeny of honey bees with the dwarf honey bees diverging first. We find that evolution of increased eusocial complexity in Apis proceeds via increases in the complexity of gene regulation, which is in agreement with previous studies. However, this process seems to be related to pathways other than transcriptional control. Positive selection patterns across Apis reveal a trade-off between maintaining genome stability and generating genetic diversity, with a rapidly evolving piRNA pathway leading to genomes depleted of transposable elements, and a rapidly evolving DNA repair pathway associated with high recombination rates in all Apis species. Diversification within Apis is accompanied by positive selection in several genes whose putative functions present candidate mechanisms for lineage-specific adaptations, such as migration, immunity, and nesting behavior.
Collapse
Affiliation(s)
- Bertrand Fouks
- Department of Biology, University of North Carolina at Greensboro, Greensboro, North Carolina 27403, USA
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, 48149 Münster, Germany
| | - Philipp Brand
- Department of Evolution and Ecology, Center for Population Biology, University of California, Davis, Davis, California 95161, USA
- Laboratory of Neurophysiology and Behavior, The Rockefeller University, New York, New York 10065, USA
| | - Hung N Nguyen
- MU Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri 65211, USA
| | - Jacob Herman
- Department of Biology, University of North Carolina at Greensboro, Greensboro, North Carolina 27403, USA
| | - Francisco Camara
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, 08036 Barcelona, Spain
| | - Daniel Ence
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida 32611, USA
- Department of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA
| | - Darren E Hagen
- Department of Animal and Food Sciences, Oklahoma State University, Stillwater, Oklahoma 74078, USA
| | - Katharina J Hoff
- University of Greifswald, Institute for Mathematics and Computer Science, Bioinformatics Group, 17489 Greifswald, Germany
- University of Greifswald, Center for Functional Genomics of Microbes, 17489 Greifswald, Germany
| | - Stefanie Nachweide
- University of Greifswald, Institute for Mathematics and Computer Science, Bioinformatics Group, 17489 Greifswald, Germany
| | - Lars Romoth
- University of Greifswald, Institute for Mathematics and Computer Science, Bioinformatics Group, 17489 Greifswald, Germany
| | - Kimberly K O Walden
- Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Roderic Guigo
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, 08036 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain
| | - Mario Stanke
- University of Greifswald, Institute for Mathematics and Computer Science, Bioinformatics Group, 17489 Greifswald, Germany
- University of Greifswald, Center for Functional Genomics of Microbes, 17489 Greifswald, Germany
| | | | - Mark Yandell
- Department of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA
- Utah Center for Genetic Discovery, University of Utah, Salt Lake City, Utah 84112, USA
| | - Hugh M Robertson
- Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Nikolaus Koeniger
- Department of Behavioral Physiology and Sociobiology (Zoology II), University of Würzburg, 97074 Würzburg, Germany
| | - Panuwan Chantawannakul
- Environmental Science Research Center (ESRC) and Department of Biology, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Michael C Schatz
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Kim C Worley
- Department of Molecular and Human Genetics, Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Gene E Robinson
- Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Neuroscience Program, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Christine G Elsik
- MU Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri 65211, USA
- Division of Animal Sciences, University of Missouri, Columbia, Missouri 65211, USA
- Division of Plant Sciences, University of Missouri, Columbia, Missouri 65211, USA
| | - Olav Rueppell
- Department of Biology, University of North Carolina at Greensboro, Greensboro, North Carolina 27403, USA
- Department of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| |
Collapse
|
4
|
Fuentes R, Letelier J, Tajer B, Valdivia LE, Mullins MC. Fishing forward and reverse: Advances in zebrafish phenomics. Mech Dev 2018; 154:296-308. [PMID: 30130581 PMCID: PMC6289646 DOI: 10.1016/j.mod.2018.08.007] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Revised: 08/06/2018] [Accepted: 08/17/2018] [Indexed: 12/15/2022]
Abstract
Understanding how the genome instructs the phenotypic characteristics of an organism is one of the major scientific endeavors of our time. Advances in genetics have progressively deciphered the inheritance, identity and biological relevance of genetically encoded information, contributing to the rise of several, complementary omic disciplines. One of them is phenomics, an emergent area of biology dedicated to the systematic multi-scale analysis of phenotypic traits. This discipline provides valuable gene function information to the rapidly evolving field of genetics. Current molecular tools enable genome-wide analyses that link gene sequence to function in multi-cellular organisms, illuminating the genome-phenome relationship. Among vertebrates, zebrafish has emerged as an outstanding model organism for high-throughput phenotyping and modeling of human disorders. Advances in both systematic mutagenesis and phenotypic analyses of embryonic and post-embryonic stages in zebrafish have revealed the function of a valuable collection of genes and the general structure of several complex traits. In this review, we summarize multiple large-scale genetic efforts addressing parental, embryonic, and adult phenotyping in the zebrafish. The genetic and quantitative tools available in the zebrafish model, coupled with the broad spectrum of phenotypes that can be assayed, make it a powerful model for phenomics, well suited for the dissection of genotype-phenotype associations in development, physiology, health and disease.
Collapse
Affiliation(s)
- Ricardo Fuentes
- Department of Cell and Developmental Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Joaquín Letelier
- Centro Andaluz de Biología del Desarrollo (CSIC/UPO/JA), Seville, Spain; Center for Integrative Biology, Facultad de Ciencias, Universidad Mayor, Santiago, Chile
| | - Benjamin Tajer
- Department of Cell and Developmental Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Leonardo E Valdivia
- Center for Integrative Biology, Facultad de Ciencias, Universidad Mayor, Santiago, Chile.
| | - Mary C Mullins
- Department of Cell and Developmental Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
5
|
Sohn JI, Nam JW. The present and future of de novo whole-genome assembly. Brief Bioinform 2018; 19:23-40. [PMID: 27742661 DOI: 10.1093/bib/bbw096] [Citation(s) in RCA: 75] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Indexed: 12/15/2022] Open
Abstract
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
Collapse
|
6
|
Erlich Y, Zielinski D. DNA Fountain enables a robust and efficient storage architecture. Science 2017; 355:950-954. [PMID: 28254941 DOI: 10.1126/science.aaj2038] [Citation(s) in RCA: 291] [Impact Index Per Article: 41.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 02/09/2017] [Indexed: 12/16/2022]
Abstract
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 106 bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 1015 retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Collapse
Affiliation(s)
- Yaniv Erlich
- New York Genome Center, New York, NY 10013, USA. .,Department of Computer Science, Fu Foundation School of Engineering, Columbia University, New York, NY 10027, USA.,Center for Computational Biology and Bioinformatics (C2B2), Department of Systems Biology, Columbia University, New York, NY 10027, USA
| | | |
Collapse
|
7
|
Mamun AA, Pal S, Rajasekaran S. KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 2016; 32:2783-90. [PMID: 27283950 DOI: 10.1093/bioinformatics/btw345] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Accepted: 05/25/2016] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION A massive number of bioinformatics applications require counting of k-length substrings in genetically important long strings. A k-mer counter generates the frequencies of each k-length substring in genome sequences. Genome assembly, repeat detection, multiple sequence alignment, error detection and many other related applications use a k-mer counter as a building block. Very fast and efficient algorithms are necessary to count k-mers in large data sets to be useful in such applications. RESULTS We propose a novel trie-based algorithm for this k-mer counting problem. We compare our devised algorithm k-mer Counter based on Multiple Burst Trees (KCMBT) with available all well-known algorithms. Our experimental results show that KCMBT is around 30% faster than the previous best-performing algorithm KMC2 for human genome dataset. As another example, our algorithm is around six times faster than Jellyfish2. Overall, KCMBT is 20-30% faster than KMC2 on five benchmark data sets when both the algorithms were run using multiple threads. AVAILABILITY AND IMPLEMENTATION KCMBT is freely available on GitHub: (https://github.com/abdullah009/kcmbt_mt). CONTACT rajasek@engr.uconn.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdullah-Al Mamun
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Soumitra Pal
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Sanguthevar Rajasekaran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
8
|
Rehan SM, Glastad KM, Lawson SP, Hunt BG. The Genome and Methylome of a Subsocial Small Carpenter Bee, Ceratina calcarata. Genome Biol Evol 2016; 8:1401-10. [PMID: 27048475 PMCID: PMC4898796 DOI: 10.1093/gbe/evw079] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/29/2016] [Indexed: 12/14/2022] Open
Abstract
Understanding the evolution of animal societies, considered to be a major transition in evolution, is a key topic in evolutionary biology. Recently, new gateways for understanding social evolution have opened up due to advances in genomics, allowing for unprecedented opportunities in studying social behavior on a molecular level. In particular, highly eusocial insect species (caste-containing societies with nonreproductives that care for siblings) have taken center stage in studies of the molecular evolution of sociality. Despite advances in genomic studies of both solitary and eusocial insects, we still lack genomic resources for early insect societies. To study the genetic basis of social traits requires comparison of genomes from a diversity of organisms ranging from solitary to complex social forms. Here we present the genome of a subsocial bee, Ceratina calcarata This study begins to address the types of genomic changes associated with the earliest origins of simple sociality using the small carpenter bee. Genes associated with lipid transport and DNA recombination have undergone positive selection in C. calcarata relative to other bee lineages. Furthermore, we provide the first methylome of a noneusocial bee. Ceratina calcarata contains the complete enzymatic toolkit for DNA methylation. As in the honey bee and many other holometabolous insects, DNA methylation is targeted to exons. The addition of this genome allows for new lines of research into the genetic and epigenetic precursors to complex social behaviors.
Collapse
Affiliation(s)
- Sandra M Rehan
- Department of Biological Sciences, University of New Hampshire, Durham
| | | | - Sarah P Lawson
- Department of Biological Sciences, University of New Hampshire, Durham
| | | |
Collapse
|
9
|
Williams TA, Nakjang S, Campbell SE, Freeman MA, Eydal M, Moore K, Hirt RP, Embley TM, Williams BAP. A Recent Whole-Genome Duplication Divides Populations of a Globally Distributed Microsporidian. Mol Biol Evol 2016; 33:2002-15. [PMID: 27189558 PMCID: PMC4948709 DOI: 10.1093/molbev/msw083] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
The Microsporidia are a major group of intracellular fungi and important parasites of animals including insects, fish, and immunocompromised humans. Microsporidian genomes have undergone extreme reductive evolution but there are major differences in genome size and structure within the group: some are prokaryote-like in size and organisation (<3 Mb of gene-dense sequence) while others have more typically eukaryotic genome architectures. To gain fine-scale, population-level insight into the evolutionary dynamics of these tiny eukaryotic genomes, we performed the broadest microsporidian population genomic study to date, sequencing geographically isolated strains of Spraguea, a marine microsporidian infecting goosefish worldwide. Our analysis revealed that population structure across the Atlantic Ocean is associated with a conserved difference in ploidy, with American and Canadian isolates sharing an ancestral whole genome duplication that was followed by widespread pseudogenisation and sorting-out of paralogue pairs. While past analyses have suggested de novo gene formation of microsporidian-specific genes, we found evidence for the origin of new genes from noncoding sequence since the divergence of these populations. Some of these genes experience selective constraint, suggesting the evolution of new functions and local host adaptation. Combining our data with published microsporidian genomes, we show that nucleotide composition across the phylum is shaped by a mutational bias favoring A and T nucleotides, which is opposed by an evolutionary force favoring an increase in genomic GC content. This study reveals ongoing dramatic reorganization of genome structure and the evolution of new gene functions in modern microsporidians despite extensive genomic streamlining in their common ancestor.
Collapse
Affiliation(s)
- Tom A Williams
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Sirintra Nakjang
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Scott E Campbell
- Biosciences, College of Life and Environmental Sciences, University of Exeter, Devon, United Kingdom
| | - Mark A Freeman
- Ross University School of Veterinary Medicine, St. Kitts, West Indies
| | - Matthías Eydal
- Institute for Experimental Pathology, University of Iceland, Keldur, Iceland
| | - Karen Moore
- Biosciences, College of Life and Environmental Sciences, University of Exeter, Devon, United Kingdom
| | - Robert P Hirt
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - T Martin Embley
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Bryony A P Williams
- Biosciences, College of Life and Environmental Sciences, University of Exeter, Devon, United Kingdom
| |
Collapse
|
10
|
Cong Q, Shen J, Warren AD, Borek D, Otwinowski Z, Grishin NV. Speciation in Cloudless Sulphurs Gleaned from Complete Genomes. Genome Biol Evol 2016; 8:915-31. [PMID: 26951782 PMCID: PMC4894063 DOI: 10.1093/gbe/evw045] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
For 200 years, zoologists have relied on phenotypes to learn about the evolution of animals. A glance at the genotype, even through several gene markers, revolutionized our understanding of animal phylogeny. Recent advances in sequencing techniques allow researchers to study speciation mechanisms and the link between genotype and phenotype using complete genomes. We sequenced and assembled a complete genome of the Cloudless Sulphur (Phoebis sennae) from a single wild-caught specimen. This genome was used as reference to compare genomes of six specimens, three from the eastern populations (Oklahoma and north Texas), referred to as a subspeciesPhoebis sennae eubule, and three from the southwestern populations (south Texas) known as a subspeciesPhoebis sennae marcellina While the two subspecies differ only subtly in phenotype and mitochondrial DNA, comparison of their complete genomes revealed consistent and significant differences, which are more prominent than those between tiger swallowtailsPterourus canadensisandPterourus glaucus The two sulphur taxa differed in histone methylation regulators, chromatin-associated proteins, circadian clock, and early development proteins. Despite being well separated on the whole-genome level, the two taxa show introgression, with gene flow mainly fromP. s. marcellinatoP. s. eubule Functional analysis of introgressed genes reveals enrichment in transmembrane transporters. Many transporters are responsible for nutrient uptake, and their introgression may be of selective advantage for caterpillars to feed on more diverse food resources. Phylogenetically, complete genomes place family Pieridae away from Papilionidae, which is consistent with previous analyses based on several gene markers.
Collapse
Affiliation(s)
- Qian Cong
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center
| | - Jinhui Shen
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center
| | - Andrew D Warren
- McGuire Center for Lepidoptera and Biodiversity, Florida Museum of Natural History, University of Florida
| | - Dominika Borek
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center
| | - Zbyszek Otwinowski
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center
| | - Nick V Grishin
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center Howard Hughes Medical Institute, University of Texas Southwestern Medical Center
| |
Collapse
|
11
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 179] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
12
|
Safonova Y, Bonissone S, Kurpilyansky E, Starostina E, Lapidus A, Stinson J, DePalatis L, Sandoval W, Lill J, Pevzner PA. IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics 2015; 31:i53-61. [PMID: 26072509 PMCID: PMC4542777 DOI: 10.1093/bioinformatics/btv238] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
UNLABELLED The analysis of concentrations of circulating antibodies in serum (antibody repertoire) is a fundamental, yet poorly studied, problem in immunoinformatics. The two current approaches to the analysis of antibody repertoires [next generation sequencing (NGS) and mass spectrometry (MS)] present difficult computational challenges since antibodies are not directly encoded in the germline but are extensively diversified by somatic recombination and hypermutations. Therefore, the protein database required for the interpretation of spectra from circulating antibodies is custom for each individual. Although such a database can be constructed via NGS, the reads generated by NGS are error-prone and even a single nucleotide error precludes identification of a peptide by the standard proteomics tools. Here, we present the IgRepertoireConstructor algorithm that performs error-correction of immunosequencing reads and uses mass spectra to validate the constructed antibody repertoires. AVAILABILITY AND IMPLEMENTATION IgRepertoireConstructor is open source and freely available as a C++ and Python program running on all Unix-compatible platforms. The source code is available from http://bioinf.spbau.ru/igtools. CONTACT ppevzner@ucsd.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yana Safonova
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Stefano Bonissone
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Eugene Kurpilyansky
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Ekaterina Starostina
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Jeremy Stinson
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Laura DePalatis
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Wendy Sandoval
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Jennie Lill
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| |
Collapse
|
13
|
Jansen G, Crummenerl LL, Gilbert F, Mohr T, Pfefferkorn R, Thänert R, Rosenstiel P, Schulenburg H. Evolutionary Transition from Pathogenicity to Commensalism: Global Regulator Mutations Mediate Fitness Gains through Virulence Attenuation. Mol Biol Evol 2015. [PMID: 26199376 PMCID: PMC4651237 DOI: 10.1093/molbev/msv160] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Symbiotic interactions are indispensable for metazoan function, but their origin and evolution remain elusive. We use a controlled evolution experiment to demonstrate the emergence of novel commensal interactions between Pseudomonas aeruginosa, an initially pathogenic bacterium, and a metazoan host, Caenorhabditis elegans. We show that commensalism evolves through loss of virulence, because it provides bacteria with a double fitness advantage: Increased within-host fitness and a larger host population to infect. Commensalism arises irrespective of host immune status, as the adaptive path in immunocompromised C. elegans knockouts does not differ from that in wild type. Dissection of temporal dynamics of genomic adaptation for 125 bacterial populations reveals highly parallel evolution of incipient commensalism across independent biological replicates. Adaptation is mainly achieved through frame shift mutations in the global regulator lasR and nonsynonymous point mutations in the polymerase gene rpoB that arise early in evolution. Genetic knockouts of lasR not only corroborate its role in virulence attenuation but also show that further mutations are necessary for the fully commensal phenotype. The evolutionary transition from pathogenicity to commensalism as we observe here is facilitated by mutations in global regulators such as lasR, because few genetic changes cause pleiotropic effects across the genome with large phenotypic effects. Finally, we found that nucleotide diversity increased more quickly in bacteria adapting to immunocompromised hosts than in those adapting to immunocompetent hosts. Nevertheless, the outcome of evolution was comparable across host types. Commensalism can thus evolve independently of host immune state solely as a side-effect of bacterial adaptation to novel hosts.
Collapse
Affiliation(s)
- Gunther Jansen
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| | - Lena L Crummenerl
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| | - Felix Gilbert
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| | - Timm Mohr
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| | - Roxana Pfefferkorn
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| | - Robert Thänert
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| | - Philip Rosenstiel
- Molecular Cell Biology, Institute for Clinical Molecular Biology, University of Kiel, Kiel, Germany
| | - Hinrich Schulenburg
- Evolutionary Ecology and Genetics, Zoological Institute, University of Kiel, Kiel, Germany
| |
Collapse
|
14
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
15
|
Kuan CS, Yew SM, Toh YF, Chan CL, Ngeow YF, Lee KW, Na SL, Yee WY, Hoh CC, Ng KP. Dissecting the fungal biology of Bipolaris papendorfii: from phylogenetic to comparative genomic analysis. DNA Res 2015; 22:219-32. [PMID: 25922537 PMCID: PMC4463846 DOI: 10.1093/dnares/dsv007] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2015] [Accepted: 03/28/2015] [Indexed: 01/21/2023] Open
Abstract
Bipolaris papendorfii has been reported as a fungal plant pathogen that rarely causes opportunistic infection in humans. Secondary metabolites isolated from this fungus possess medicinal and anticancer properties. However, its genetic fundamental and basic biology are largely unknown. In this study, we report the first draft genome sequence of B. papendorfii UM 226 isolated from the skin scraping of a patient. The assembled 33.4 Mb genome encodes 11,015 putative coding DNA sequences, of which, 2.49% are predicted transposable elements. Multilocus phylogenetic and phylogenomic analyses showed B. papendorfii UM 226 clustering with Curvularia species, apart from other plant pathogenic Bipolaris species. Its genomic features suggest that it is a heterothallic fungus with a putative unique gene encoding the LysM-containing protein which might be involved in fungal virulence on host plants, as well as a wide array of enzymes involved in carbohydrate metabolism, degradation of polysaccharides and lignin in the plant cell wall, secondary metabolite biosynthesis (including dimethylallyl tryptophan synthase, non-ribosomal peptide synthetase, polyketide synthase), the terpenoid pathway and the caffeine metabolism. This first genomic characterization of B. papendorfii provides the basis for further studies on its biology, pathogenicity and medicinal potential.
Collapse
Affiliation(s)
- Chee Sian Kuan
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| | - Su Mei Yew
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| | - Yue Fen Toh
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| | - Chai Ling Chan
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| | - Yun Fong Ngeow
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| | - Kok Wei Lee
- Codon Genomics SB, Jalan Bandar Lapan Belas, Selangor Darul Ehsan 47160, Malaysia
| | - Shiang Ling Na
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| | - Wai-Yan Yee
- Codon Genomics SB, Jalan Bandar Lapan Belas, Selangor Darul Ehsan 47160, Malaysia
| | - Chee-Choong Hoh
- Codon Genomics SB, Jalan Bandar Lapan Belas, Selangor Darul Ehsan 47160, Malaysia
| | - Kee Peng Ng
- Department of Medical Microbiology, Faculty of Medicine, University of Malaya, Kuala Lumpur 50603, Malaysia
| |
Collapse
|
16
|
Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015; 31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open
|
17
|
|
18
|
Soares-Castro P, Santos PM. Deciphering the genome repertoire of Pseudomonas sp. M1 toward β-myrcene biotransformation. Genome Biol Evol 2014; 7:1-17. [PMID: 25503374 PMCID: PMC4316614 DOI: 10.1093/gbe/evu254] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Pseudomonas sp. M1 is able to mineralize several unusual substrates of natural and xenobiotic origin, contributing to its competence to thrive in different ecological niches. In this work, the genome of M1 strain was resequenced by Illumina MiSeq to refine the quality of a published draft by resolving the majority of repeat-rich regions. In silico genome analysis led to the prediction of metabolic pathways involved in biotransformation of several unusual substrates (e.g., plant-derived volatiles), providing clues on the genomic complement required for such biodegrading/biotransformation functionalities. Pseudomonas sp. M1 exhibits a particular sensory and biotransformation/biocatalysis potential toward β-myrcene, a terpene vastly used in industries worldwide. Therefore, the genomic responsiveness of M1 strain toward β-myrcene was investigated, using an RNA sequencing approach. M1 cells challenged with β-myrcene(compared with cells grown in lactate) undergo an extensive alteration of the transcriptome expression profile, including 1,873 genes evidencing at least 1.5-fold of altered expression (627 upregulated and 1,246 downregulated), toward β-myrcene-imposed molecular adaptation and cellular specialization. A thorough data analysis identified a novel 28-kb genomic island, whose expression was strongly stimulated in β-myrcene-supplemented medium, that is essential for β-myrcene catabolism. This island includes β-myrcene-induced genes whose products are putatively involved in 1) substrate sensing, 2) gene expression regulation, and 3) β-myrcene oxidation and bioconversion of β-myrcene derivatives into central metabolism intermediates. In general, this locus does not show high homology with sequences available in databases and seems to have evolved through the assembly of several functional blocks acquired from different bacteria, probably, at different evolutionary stages.
Collapse
Affiliation(s)
- Pedro Soares-Castro
- CBMA-Centre of Molecular and Environmental Biology, Department of Biology, University of Minho, Campus de Gualtar, Braga, Portugal
| | - Pedro M Santos
- CBMA-Centre of Molecular and Environmental Biology, Department of Biology, University of Minho, Campus de Gualtar, Braga, Portugal
| |
Collapse
|
19
|
Emmert-Streib F, Tripathi S, Simoes RDM, Hawwa AF, Dehmer M. The human disease network. ACTA ACUST UNITED AC 2014. [DOI: 10.4161/sysb.22816] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
20
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
21
|
Carvalho AB, Clark AG. Efficient identification of Y chromosome sequences in the human and Drosophila genomes. Genome Res 2013; 23:1894-907. [PMID: 23921660 PMCID: PMC3814889 DOI: 10.1101/gr.156034.113] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 07/25/2013] [Indexed: 12/25/2022]
Abstract
Notwithstanding their biological importance, Y chromosomes remain poorly known in most species. A major obstacle to their study is the identification of Y chromosome sequences; due to its high content of repetitive DNA, in most genome projects, the Y chromosome sequence is fragmented into a large number of small, unmapped scaffolds. Identification of Y-linked genes among these fragments has yielded important insights about the origin and evolution of Y chromosomes, but the process is labor intensive, restricting studies to a small number of species. Apart from these fragmentary assemblies, in a few mammalian species, the euchromatic sequence of the Y is essentially complete, owing to painstaking BAC mapping and sequencing. Here we use female short-read sequencing and k-mer comparison to identify Y-linked sequences in two very different genomes, Drosophila virilis and human. Using this method, essentially all D. virilis scaffolds were unambiguously classified as Y-linked or not Y-linked. We found 800 new scaffolds (totaling 8.5 Mbp), and four new genes in the Y chromosome of D. virilis, including JYalpha, a gene involved in hybrid male sterility. Our results also strongly support the preponderance of gene gains over gene losses in the evolution of the Drosophila Y. In the intensively studied human genome, used here as a positive control, we recovered all previously known genes or gene families, plus a small amount (283 kb) of new, unfinished sequence. Hence, this method works in large and complex genomes and can be applied to any species with sex chromosomes.
Collapse
Affiliation(s)
- Antonio Bernardo Carvalho
- Departamento de Genética, Universidade Federal do Rio de Janeiro, Caixa Postal 68011, CEP 21941-971, Rio de Janeiro, Brazil
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Andrew G. Clark
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA
| |
Collapse
|
22
|
Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics 2013; 30:31-7. [DOI: 10.1093/bioinformatics/btt310] [Citation(s) in RCA: 481] [Impact Index Per Article: 43.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
23
|
Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, Chen K. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform 2013; 15:637-47. [PMID: 23396756 DOI: 10.1093/bib/bbs088] [Citation(s) in RCA: 107] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics.
Collapse
|
24
|
Abstract
Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community.
Collapse
|
25
|
Abstract
MOTIVATION Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model. RESULTS SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly. AVAILABILITY SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.
Collapse
Affiliation(s)
- Roy Ronen
- Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | | | | | | |
Collapse
|
26
|
Linghu B, Franzosa EA, Xia Y. Construction of functional linkage gene networks by data integration. Methods Mol Biol 2013. [PMID: 23192549 DOI: 10.1007/978-1-62703-107-3_14] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Networks of functional associations between genes have recently been successfully used for gene function and disease-related research. A typical approach for constructing such functional linkage gene networks (FLNs) is based on the integration of diverse high-throughput functional genomics datasets. Data integration is a nontrivial task due to the heterogeneous nature of the different data sources and their variable accuracy and completeness. The presence of correlations between data sources also adds another layer of complexity to the integration process. In this chapter we discuss an approach for constructing a human FLN from data integration and a subsequent application of the FLN to novel disease gene discovery. Similar approaches can be applied to nonhuman species and other discovery tasks.
Collapse
Affiliation(s)
- Bolan Linghu
- Translational Sciences Department, Novartis Institutes for BioMedical Research, Cambridge, MA, USA.
| | | | | |
Collapse
|
27
|
Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, Shi Y, Zhang H, Li Y, Lam TW, Luo R. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 2012; 28:2870-4. [PMID: 23044551 DOI: 10.1093/bioinformatics/bts563] [Citation(s) in RCA: 119] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION The boost of next-generation sequencing technologies provides us with an unprecedented opportunity for elucidating genetic mysteries, yet the short-read length hinders us from better assembling the genome from scratch. New protocols now exist that can generate overlapping pair-end reads. By joining the 3' ends of each read pair, one is able to construct longer reads for assembling. However, effectively joining two overlapped pair-end reads remains a challenging task. RESULT In this article, we present an efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads. AVAILABILITY AND IMPLEMENTATION COPE is implemented in C++ and is freely available as open-source code at ftp://ftp.genomics.org.cn/pub/cope. CONTACT twlam@cs.hku.hk or luoruibang@genomics.org.cn
Collapse
Affiliation(s)
- Binghang Liu
- HKU-BGI BAL-Bioinformatics Algorithms and Core Technology Research Laboratory, The University of Hong Kong, Hong Kong
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Hu TT, Eisen MB, Thornton KR, Andolfatto P. A second-generation assembly of the Drosophila simulans genome provides new insights into patterns of lineage-specific divergence. Genome Res 2012; 23:89-98. [PMID: 22936249 PMCID: PMC3530686 DOI: 10.1101/gr.141689.112] [Citation(s) in RCA: 119] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
We create a new assembly of the Drosophila simulans genome using 142 million paired short-read sequences and previously published data for strain w501. Our assembly represents a higher-quality genomic sequence with greater coverage, fewer misassemblies, and, by several indexes, fewer sequence errors. Evolutionary analysis of this genome reference sequence reveals interesting patterns of lineage-specific divergence that are different from those previously reported. Specifically, we find that Drosophila melanogaster evolves faster than D. simulans at all annotated classes of sites, including putatively neutrally evolving sites found in minimal introns. While this may be partly explained by a higher mutation rate in D. melanogaster, we also find significant heterogeneity in rates of evolution across classes of sites, consistent with historical differences in the effective population size for the two species. Also contrary to previous findings, we find that the X chromosome is evolving significantly faster than autosomes for nonsynonymous and most noncoding DNA sites and significantly slower for synonymous sites. The absence of a X/A difference for putatively neutral sites and the robustness of the pattern to Gene Ontology and sex-biased expression suggest that partly recessive beneficial mutations may comprise a substantial fraction of noncoding DNA divergence observed between species. Our results have more general implications for the interpretation of evolutionary analyses of genomes of different quality.
Collapse
Affiliation(s)
- Tina T Hu
- Department of Ecology and Evolutionary Biology and the Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA.
| | | | | | | |
Collapse
|
29
|
Huang S, Chen Z, Huang G, Yu T, Yang P, Li J, Fu Y, Yuan S, Chen S, Xu A. HaploMerger: reconstructing allelic relationships for polymorphic diploid genome assemblies. Genome Res 2012; 22:1581-8. [PMID: 22555592 PMCID: PMC3409271 DOI: 10.1101/gr.133652.111] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2011] [Accepted: 05/02/2012] [Indexed: 11/24/2022]
Abstract
Whole-genome shotgun assembly has been a long-standing issue for highly polymorphic genomes, and the advent of next-generation sequencing technologies has made the issue more challenging than ever. Here we present an automated pipeline, HaploMerger, for reconstructing allelic relationships in a diploid assembly. HaploMerger combines a LASTZ-ChainNet alignment approach with a novel graph-based structure, which helps to untangle allelic relationships between two haplotypes and guides the subsequent creation of reference haploid assemblies. The pipeline provides flexible parameters and schemes to improve the contiguity, continuity, and completeness of the reference assemblies. We show that HaploMerger produces efficient and accurate results in simulations and has advantages over manual curation when applied to real polymorphic assemblies (e.g., 4%-5% heterozygosity). We also used HaploMerger to analyze the diploid assembly of a single Chinese amphioxus (Branchiostoma belcheri) and compared the resulting haploid assemblies with EST sequences, which revealed that the two haplotypes are not only divergent but also highly complementary to each other. Taken together, we have demonstrated that HaploMerger is an effective tool for analyzing and exploiting polymorphic genome assemblies.
Collapse
Affiliation(s)
- Shengfeng Huang
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Zelin Chen
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Guangrui Huang
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Ting Yu
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Ping Yang
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Jie Li
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Yonggui Fu
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Shaochun Yuan
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Shangwu Chen
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| | - Anlong Xu
- State Key Laboratory of Biocontrol, Guangdong Key Laboratory of Pharmaceutical Functional Genes, College of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, People's Republic of China
| |
Collapse
|
30
|
|
31
|
Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012; 28:1420-8. [PMID: 22495754 DOI: 10.1093/bioinformatics/bts174] [Citation(s) in RCA: 1990] [Impact Index Per Article: 165.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. RESULTS We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. AVAILABILITY The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud
Collapse
Affiliation(s)
- Yu Peng
- Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
| | | | | | | |
Collapse
|
32
|
Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2012; 14:56-66. [DOI: 10.1093/bib/bbs015] [Citation(s) in RCA: 177] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
33
|
Tenaillon O, Rodríguez-Verdugo A, Gaut RL, McDonald P, Bennett AF, Long AD, Gaut BS. The molecular diversity of adaptive convergence. Science 2012; 335:457-61. [PMID: 22282810 DOI: 10.1126/science.1212986] [Citation(s) in RCA: 504] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
To estimate the number and diversity of beneficial mutations, we experimentally evolved 115 populations of Escherichia coli to 42.2°C for 2000 generations and sequenced one genome from each population. We identified 1331 total mutations, affecting more than 600 different sites. Few mutations were shared among replicates, but a strong pattern of convergence emerged at the level of genes, operons, and functional complexes. Our experiment uncovered a set of primary functional targets of high temperature, but we estimate that many other beneficial mutations could contribute to similar adaptive outcomes. We inferred the pervasive presence of epistasis among beneficial mutations, which shaped adaptive trajectories into at least two distinct pathways involving mutations either in the RNA polymerase complex or the termination factor rho.
Collapse
Affiliation(s)
- Olivier Tenaillon
- Department of Ecology and Evolutionary Biology, University of California-Irvine, CA 92697, USA.
| | | | | | | | | | | | | |
Collapse
|
34
|
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 2012; 22:557-67. [PMID: 22147368 DOI: 10.1101/gr.131383.111] [Citation(s) in RCA: 412] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.
Collapse
Affiliation(s)
- Steven L Salzberg
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Schatz MC, Phillippy AM, Sommer DD, Delcher AL, Puiu D, Narzisi G, Salzberg SL, Pop M. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief Bioinform 2011; 14:213-24. [PMID: 22199379 DOI: 10.1093/bib/bbr074] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.
Collapse
|
36
|
Li Z, Chen Y, Mu D, Yuan J, Shi Y, Zhang H, Gan J, Li N, Hu X, Liu B, Yang B, Fan W. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct Genomics 2011; 11:25-37. [DOI: 10.1093/bfgp/elr035] [Citation(s) in RCA: 146] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
37
|
Abstract
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.
Collapse
|
38
|
Medvedev P, Scott E, Kakaradov B, Pevzner P. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 2011; 27:i137-41. [PMID: 21685062 PMCID: PMC3117386 DOI: 10.1093/bioinformatics/btr208] [Citation(s) in RCA: 85] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new applications. As a result, error correction of sequencing reads remains an important problem. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single-cell sequencing, remains open. RESULTS In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell data, while achieving comparable results on normal multi-cell data. AVAILABILITY http://www.cs.toronto.edu/~pashadag. CONTACT pmedvedev@cs.ucsd.edu.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA.
| | | | | | | |
Collapse
|
39
|
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011; 21:2224-41. [PMID: 21926179 DOI: 10.1101/gr.126599.111] [Citation(s) in RCA: 318] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Collapse
Affiliation(s)
- Dent Earl
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 2011; 27:2957-63. [PMID: 21903629 DOI: 10.1093/bioinformatics/btr507] [Citation(s) in RCA: 8382] [Impact Index Per Article: 644.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome. RESULTS We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds. AVAILABILITY AND IMPLEMENTATION The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash. CONTACT t.magoc@gmail.com.
Collapse
Affiliation(s)
- Tanja Magoč
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| | | |
Collapse
|
41
|
Salmela L, Schroder J. Correcting errors in short reads by multiple alignments. Bioinformatics 2011; 27:1455-61. [DOI: 10.1093/bioinformatics/btr170] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|