Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Medvedev P, Brudno M. Maximum likelihood genome assembly. J Comput Biol 2009;16:1101-16. [PMID: 19645596 DOI: 10.1089/cmb.2009.0047] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Medvedev P. Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond. COMMUNICATIONS OF THE ACM 2023;66:118-125. [PMID: 38736702 PMCID: PMC11087067 DOI: 10.1145/3571723] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]

Rahman A, Pachter L. SWALO: scaffolding with assembly likelihood optimization. Nucleic Acids Res 2021;49:e117. [PMID: 34417615 PMCID: PMC8599790 DOI: 10.1093/nar/gkab717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 06/16/2021] [Accepted: 08/16/2021] [Indexed: 01/01/2023] Open

Ayling M, Clark MD, Leggett RM. New approaches for metagenome assembly with short reads. Brief Bioinform 2021;21:584-594. [PMID: 30815668 PMCID: PMC7299287 DOI: 10.1093/bib/bbz020] [Citation(s) in RCA: 100] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 01/31/2019] [Accepted: 02/01/2019] [Indexed: 02/07/2023] Open

Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021;22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open

Steyaert A, Audenaert P, Fostier J. Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields. BMC Bioinformatics 2020;21:402. [PMID: 32928110 PMCID: PMC7491180 DOI: 10.1186/s12859-020-03740-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 09/04/2020] [Indexed: 12/01/2022] Open

Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, Genereux D, Johnson J, Marinescu VD, Alföldi J, Harris RS, Lindblad-Toh K, Haussler D, Karlsson E, Jarvis ED, Zhang G, Paten B. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature 2020;587:246-251. [PMID: 33177663 PMCID: PMC7673649 DOI: 10.1038/s41586-020-2871-y] [Citation(s) in RCA: 182] [Impact Index Per Article: 45.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Accepted: 07/27/2020] [Indexed: 12/11/2022]

Affiliation(s)

Joel Armstrong UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
Glenn Hickey UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
Mark Diekhans UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
Ian T Fiddes UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
Adam M Novak UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
Alden Deran UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
Qi Fang BGI-Shenzhen, Beishan Industrial Zone, Shenzhen, China Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Duo Xie BGI-Shenzhen, Beishan Industrial Zone, Shenzhen, China BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, China
Shaohong Feng BGI-Shenzhen, Beishan Industrial Zone, Shenzhen, China State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
Josefin Stiller Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Diane Genereux Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
Jeremy Johnson Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
Voichita Dana Marinescu Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
Jessica Alföldi Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
Robert S Harris Department of Biology, The Pennsylvania State University, University Park, PA, USA
Kerstin Lindblad-Toh Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA, USA Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
David Haussler UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA Howard Hughes Medical Institute, Chevy Chase, MD, USA
Elinor Karlsson Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA, USA Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
Erich D Jarvis Howard Hughes Medical Institute, Chevy Chase, MD, USA Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
Guojie Zhang Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark. State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China. Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China. China National GeneBank, BGI-Shenzhen, Shenzhen, China.
Benedict Paten UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA.

Collapse

Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253. [PMID: 32972461 PMCID: PMC7513500 DOI: 10.1186/s13059-020-02157-2] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 08/26/2020] [Indexed: 02/07/2023] Open

Beyer W, Novak AM, Hickey G, Chan J, Tan V, Paten B, Zerbino DR. Sequence tube maps: making graph genomes intuitive to commuters. Bioinformatics 2020;35:5318-5320. [PMID: 31368484 PMCID: PMC6954646 DOI: 10.1093/bioinformatics/btz597] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Revised: 06/27/2019] [Accepted: 07/26/2019] [Indexed: 12/19/2022] Open

Yanes L, Garcia Accinelli G, Wright J, Ward BJ, Clavijo BJ. A Sequence Distance Graph framework for genome assembly and analysis. F1000Res 2019;8:1490. [PMID: 31723420 PMCID: PMC6833988 DOI: 10.12688/f1000research.20233.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/12/2019] [Indexed: 11/20/2022] Open

Salmela L, Tomescu AI. Safely Filling Gaps with Partial Solutions Common to All Solutions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019;16:617-626. [PMID: 29994355 DOI: 10.1109/tcbb.2017.2785831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Haussler D, Smuga-Otto M, Eizenga JM, Paten B, Novak AM, Nikitin S, Zueva M, Miagkov D. A Flow Procedure for Linearization of Genome Sequence Graphs. J Comput Biol 2018;25:664-676. [PMID: 29792514 PMCID: PMC6067104 DOI: 10.1089/cmb.2017.0248] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Paten B, Eizenga JM, Rosen YM, Novak AM, Garrison E, Hickey G. Superbubbles, Ultrabubbles, and Cacti. J Comput Biol 2018;25:649-663. [PMID: 29461862 DOI: 10.1089/cmb.2017.0251] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Obscura Acosta N, Mäkinen V, Tomescu AI. A safe and complete algorithm for metagenomic assembly. Algorithms Mol Biol 2018;13:3. [PMID: 29445416 PMCID: PMC5802251 DOI: 10.1186/s13015-018-0122-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Accepted: 01/20/2018] [Indexed: 11/10/2022] Open

Abstract

Background

Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G.

Approach

We address this problem with the “safe and complete” framework of Tomescu and Medvedev (Research in computational Molecular biology—20th annual conference, RECOMB 9649:152–163, 2016). An algorithm is called safe if it returns only those walks (also called safe) that appear as subwalk in all metagenomic assembly solutions for G. A safe algorithm is called complete if it returns all safe walks of G.

Results

We give graph-theoretic characterizations of the safe walks of G, and a safe and complete algorithm finding all safe walks of G. In the node-covering case, our algorithm runs in time \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(m^2 + n^3)$$\end{document}O(m2+n3), and in the edge-covering case it runs in time \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(m^2n)$$\end{document}O(m2n); n and m denote the number of nodes and edges, respectively, of G. This algorithm constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.

Collapse

Shomorony I, Kim SH, Courtade TA, Tse DNC. Information-optimal genome assembly via sparse read-overlap graphs. Bioinformatics 2017;32:i494-i502. [PMID: 27587667 DOI: 10.1093/bioinformatics/btw450] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Novak AM, Garrison E, Paten B. A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol Biol 2017;12:18. [PMID: 28702075 PMCID: PMC5505026 DOI: 10.1186/s13015-017-0109-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Accepted: 06/17/2017] [Indexed: 01/23/2023] Open

Paten B, Novak AM, Eizenga JM, Garrison E. Genome graphs and the evolution of genome inference. Genome Res 2017;27:665-676. [PMID: 28360232 PMCID: PMC5411762 DOI: 10.1101/gr.214155.116] [Citation(s) in RCA: 164] [Impact Index Per Article: 23.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Dzamba M, Ramani AK, Buczkowicz P, Jiang Y, Yu M, Hawkins C, Brudno M. Identification of complex genomic rearrangements in cancers using CouGaR. Genome Res 2016;27:107-117. [PMID: 27986820 PMCID: PMC5204335 DOI: 10.1101/gr.211201.116] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 11/10/2016] [Indexed: 12/17/2022]

Tomescu AI, Medvedev P. Safe and Complete Contig Assembly Through Omnitigs. J Comput Biol 2016;24:590-602. [PMID: 27749096 DOI: 10.1089/cmb.2016.0141] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Gao S, Bertrand D, Chia BKH, Nagarajan N. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biol 2016;17:102. [PMID: 27169502 PMCID: PMC4864936 DOI: 10.1186/s13059-016-0951-y] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2015] [Accepted: 04/13/2016] [Indexed: 11/10/2022] Open

A Graph Extension of the Positional Burrows-Wheeler Transform and Its Applications. LECTURE NOTES IN COMPUTER SCIENCE 2016. [DOI: 10.1007/978-3-319-43681-4_20] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Boža V, Brejová B, Vinař T. GAML: genome assembly by maximum likelihood. Algorithms Mol Biol 2015;10:18. [PMID: 26042154 PMCID: PMC4454275 DOI: 10.1186/s13015-015-0052-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 05/07/2015] [Indexed: 11/10/2022] Open

Simpson JT, Pop M. The Theory and Practice of Genome Sequence Assembly. Annu Rev Genomics Hum Genet 2015;16:153-72. [PMID: 25939056 DOI: 10.1146/annurev-genom-090314-050032] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Yasuda T, Miyano S. Inferring the global structure of chromosomes from structural variations. BMC Genomics 2015;16 Suppl 2:S13. [PMID: 25707904 PMCID: PMC4331713 DOI: 10.1186/1471-2164-16-s2-s13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Kent WJ, Haussler D, Paten B. Building a pan-genome reference for a population. J Comput Biol 2015;22:387-401. [PMID: 25565268 DOI: 10.1089/cmb.2014.0146] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open

The complex task of choosing a de novo assembly: Lessons from fungal genomes. Comput Biol Chem 2014;53 Pt A:97-107. [DOI: 10.1016/j.compbiolchem.2014.08.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 12/21/2022]

Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics 2014;15:302. [PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 08/01/2014] [Indexed: 11/10/2022] Open

Lindsay J, Salooti H, Măndoiu I, Zelikovsky A. ILP-based maximum likelihood genome scaffolding. BMC Bioinformatics 2014;15 Suppl 9:S9. [PMID: 25253180 PMCID: PMC4168704 DOI: 10.1186/1471-2105-15-s9-s9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open

Bayesian genome assembly and assessment by markov chain monte carlo sampling. PLoS One 2014;9:e99497. [PMID: 24968249 PMCID: PMC4072599 DOI: 10.1371/journal.pone.0099497] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2014] [Accepted: 05/15/2014] [Indexed: 11/26/2022] Open

Paten B, Zerbino DR, Hickey G, Haussler D. A unifying model of genome evolution under parsimony. BMC Bioinformatics 2014;15:206. [PMID: 24946830 PMCID: PMC4082375 DOI: 10.1186/1471-2105-15-206] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 05/08/2014] [Indexed: 11/23/2022] Open

Bernard E, Jacob L, Mairal J, Vert JP. Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 2014;30:2447-55. [PMID: 24813214 PMCID: PMC4147886 DOI: 10.1093/bioinformatics/btu317] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open

Affiliation(s)

Elsa Bernard Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
Laurent Jacob Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
Julien Mairal Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
Jean-Philippe Vert Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France

Collapse

Koren S, Treangen TJ, Hill CM, Pop M, Phillippy AM. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics 2014;15:126. [PMID: 24884846 PMCID: PMC4030574 DOI: 10.1186/1471-2105-15-126] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2014] [Accepted: 04/24/2014] [Indexed: 11/12/2022] Open

Abstract

Background

The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.

Results

To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.

Conclusions

Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.

Collapse

Miga KH, Newton Y, Jain M, Altemose N, Willard HF, Kent WJ. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Res 2014;24:697-707. [PMID: 24501022 PMCID: PMC3975068 DOI: 10.1101/gr.159624.113] [Citation(s) in RCA: 156] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

Orenstein Y, Shamir R. Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers. Bioinformatics 2013;29:i71-9. [PMID: 23813011 PMCID: PMC3694677 DOI: 10.1093/bioinformatics/btt230] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open

El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013;9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Howison M, Zapata F, Dunn CW. Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 2013;29:2959-63. [PMID: 24021385 DOI: 10.1093/bioinformatics/btt525] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Ghodsi M, Hill CM, Astrovskaya I, Lin H, Sommer DD, Koren S, Pop M. De novo likelihood-based measures for comparing genome assemblies. BMC Res Notes 2013;6:334. [PMID: 23965294 PMCID: PMC3765854 DOI: 10.1186/1756-0500-6-334] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Accepted: 08/13/2013] [Indexed: 12/12/2022] Open

Abstract

Background

The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics.

Results

We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled.

Conclusion

Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.

Collapse

Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics 2013;14 Suppl 5:S18. [PMID: 23902516 PMCID: PMC3706340 DOI: 10.1186/1471-2105-14-s5-s18] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open

Kapun E, Tsarev F. De Bruijn Superwalk with Multiplicities Problem is NP-hard. BMC Bioinformatics 2013;14 Suppl 5:S7. [PMID: 23734822 PMCID: PMC3622630 DOI: 10.1186/1471-2105-14-s5-s7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Rahman A, Pachter L. CGAL: computing genome assembly likelihoods. Genome Biol 2013;14:R8. [PMID: 23360652 PMCID: PMC3663106 DOI: 10.1186/gb-2013-14-1-r8] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2012] [Accepted: 01/29/2013] [Indexed: 01/12/2023] Open

Kapun E, Tsarev F. On NP-Hardness of the Paired de Bruijn Sound Cycle Problem. LECTURE NOTES IN COMPUTER SCIENCE 2013. [DOI: 10.1007/978-3-642-40453-5_6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]

Nijkamp JF, van den Broek MA, Geertman JMA, Reinders MJT, Daran JMG, de Ridder D. De novo detection of copy number variation by co-assembly. Bioinformatics 2012;28:3195-202. [DOI: 10.1093/bioinformatics/bts601] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Bashir A, Klammer A, Robins WP, Chin CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, Sebra R, Sorenson J, Bullard J, Yen J, Valdovino M, Mollova E, Luong K, Lin S, LaMay B, Joshi A, Rowe L, Frace M, Tarr CL, Turnsek M, Davis BM, Kasarskis A, Mekalanos JJ, Waldor MK, Schadt EE. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 2012;30:701-707. [PMID: 22750883 DOI: 10.1038/nbt.2288] [Citation(s) in RCA: 158] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 05/30/2012] [Indexed: 02/08/2023]

Affiliation(s)

Ali Bashir Pacific Biosciences, Menlo Park, CA
Aaron Klammer Pacific Biosciences, Menlo Park, CA
William P Robins Department of Medicine, Harvard Medical School, Boston, MA
Chen-Shan Chin Pacific Biosciences, Menlo Park, CA
Dale Webster Pacific Biosciences, Menlo Park, CA
Ellen Paxinos Pacific Biosciences, Menlo Park, CA
David Hsu Pacific Biosciences, Menlo Park, CA
Meredith Ashby Pacific Biosciences, Menlo Park, CA
Susana Wang Pacific Biosciences, Menlo Park, CA
Paul Peluso Pacific Biosciences, Menlo Park, CA
Robert Sebra Pacific Biosciences, Menlo Park, CA
Jon Sorenson Pacific Biosciences, Menlo Park, CA
James Bullard Pacific Biosciences, Menlo Park, CA
Jackie Yen Pacific Biosciences, Menlo Park, CA
Marie Valdovino Pacific Biosciences, Menlo Park, CA
Emilia Mollova Pacific Biosciences, Menlo Park, CA
Khai Luong Pacific Biosciences, Menlo Park, CA
Steven Lin Pacific Biosciences, Menlo Park, CA
Brianna LaMay Pacific Biosciences, Menlo Park, CA
Amruta Joshi Pacific Biosciences, Menlo Park, CA
Lori Rowe National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Michael Frace National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Cheryl L Tarr National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Maryann Turnsek National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Brigid M Davis Channing Laboratory, Brigham and Women's Hospital, Boston, MA.,Department of Medicine, Harvard Medical School, Boston, MA.,Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA.,Howard Hughes Medical Institute, Boston, MA
Andrew Kasarskis Pacific Biosciences, Menlo Park, CA
John J Mekalanos Department of Medicine, Harvard Medical School, Boston, MA
Matthew K Waldor Channing Laboratory, Brigham and Women's Hospital, Boston, MA.,Department of Medicine, Harvard Medical School, Boston, MA.,Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA.,Howard Hughes Medical Institute, Boston, MA
Eric E Schadt Pacific Biosciences, Menlo Park, CA.,Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York City

Collapse

Sahli M, Shibuya T. Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes. BMC Res Notes 2012;5:243. [PMID: 22591859 PMCID: PMC3441218 DOI: 10.1186/1756-0500-5-243] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2011] [Accepted: 05/16/2012] [Indexed: 11/12/2022] Open

Abstract

Background

Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers.

Findings

In this paper, we present Arapan-S, a whole-genome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for Influenza Virus A. The Arapan-S program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous k-mers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or k-mers’ lengths. Arapan-S uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive.

Conclusions

Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of Arapan-S with other well-known assemblers. In dealing with small genomes, the accuracy of Arapan-S is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds.

Arapan-S is available for free to the public. The binary files for Arapan-S are available through http://sourceforge.net/projects/dnascissor/files/.

Collapse

Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics 2012;13 Suppl 6:S10. [PMID: 22537039 PMCID: PMC3358655 DOI: 10.1186/1471-2105-13-s6-s10] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open

Abstract

Background

A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants - including duplications, deletions, inversions, translocations, and other rearrangements - result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from paired-end DNA sequencing data.

Results

By aligning paired reads from a cancer genome - and a matched germline genome, if available - to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called Paired-end Reconstruction of Genome Organization (PREGO), to solve this problem by reducing it to an optimization problem on an interval-adjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. non-reciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles.

Conclusions

We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at http://compbio.cs.brown.edu/software/.

Collapse

Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011;21:2224-41. [PMID: 21926179 DOI: 10.1101/gr.126599.111] [Citation(s) in RCA: 318] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]

Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011. [PMID: 21926179 DOI: 10.1101/gr.126599] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Res 2011;21:1512-28. [PMID: 21665927 DOI: 10.1101/gr.123356.111] [Citation(s) in RCA: 157] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Wetzel J, Kingsford C, Pop M. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 2011;12:95. [PMID: 21486487 PMCID: PMC3103447 DOI: 10.1186/1471-2105-12-95] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2010] [Accepted: 04/13/2011] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.

RESULTS

We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.

CONCLUSIONS

Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.

Collapse

Miller CA, Buckley KM, Easley RL, Smith LC. An Sp185/333 gene cluster from the purple sea urchin and putative microsatellite-mediated gene diversification. BMC Genomics 2010;11:575. [PMID: 20955585 PMCID: PMC3091723 DOI: 10.1186/1471-2164-11-575] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2010] [Accepted: 10/18/2010] [Indexed: 11/19/2022] Open

Abstract

Background

The immune system of the purple sea urchin, Strongylocentrotus purpuratus, is complex and sophisticated. An important component of sea urchin immunity is the Sp185/333 gene family, which is significantly upregulated in immunologically challenged animals. The Sp185/333 genes are less than 2 kb with two exons and are members of a large diverse family composed of greater than 40 genes. The S. purpuratus genome assembly, however, contains only six Sp185/333 genes. This underrepresentation could be due to the difficulties that large gene families present in shotgun assembly, where multiple similar genes can be collapsed into a single consensus gene.

Results

To understand the genomic organization of the Sp185/333 gene family, a BAC insert containing Sp185/333 genes was assembled, with careful attention to avoiding artifacts resulting from collapse or artificial duplication/expansion of very similar genes. Twelve candidate BAC assemblies were generated with varying parameters and the optimal assembly was identified by PCR, restriction digests, and subclone sequencing. The validated assembly contained six Sp185/333 genes that were clustered in a 34 kb region at one end of the BAC with five of the six genes tightly clustered within 20 kb. The Sp185/333 genes in this cluster were no more similar to each other than to previously sequenced Sp185/333 genes isolated from three different animals. This was unexpected given their proximity and putative effects of gene homogenization in closely linked, similar genes. All six genes displayed significant similarity including both 5' and 3' flanking regions, which were bounded by microsatellites. Three of the Sp185/333 genes and their flanking regions were tandemly duplicated such that each repeated segment consisted of a gene plus 0.7 kb 5' and 2.4 kb 3' of the gene (4.5 kb total). Both edges of the segmental duplications were bounded by different microsatellites.

Conclusions

The high sequence similarity of the Sp185/333 genes and flanking regions, suggests that the microsatellites may promote genomic instability and are involved with gene duplication and/or gene conversion and the extraordinary sequence diversity of this family.

Collapse