Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Choi JH, Kim S, Tang H, Andrews J, Gilbert DG, Colbourne JK. A machine-learning approach to combined evidence validation of genome assemblies. ACTA ACUST UNITED AC 2008;24:744-50. [PMID: 18204064 DOI: 10.1093/bioinformatics/btm608] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]

For:	Choi JH, Kim S, Tang H, Andrews J, Gilbert DG, Colbourne JK. A machine-learning approach to combined evidence validation of genome assemblies. ACTA ACUST UNITED AC 2008;24:744-50. [PMID: 18204064 DOI: 10.1093/bioinformatics/btm608] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]

Number

Cited by Other Article(s)

Padovani de Souza K, Setubal JC, Ponce de Leon F de Carvalho AC, Oliveira G, Chateau A, Alves R. Machine learning meets genome assembly. Brief Bioinform 2020;20:2116-2129. [PMID: 30137230 DOI: 10.1093/bib/bby072] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 07/11/2018] [Accepted: 07/22/2018] [Indexed: 12/23/2022] Open

Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol 2014;10:e1003998. [PMID: 25474019 PMCID: PMC4256071 DOI: 10.1371/journal.pcbi.1003998] [Citation(s) in RCA: 172] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Accepted: 10/22/2014] [Indexed: 11/19/2022] Open

Abstract

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

The initial publication of the genome sequence of many plants, animals, and microbes is often accompanied with great fanfare. However, these genomes are almost always first-drafts, with a lot of missing data, many gaps, and many errors in the published sequences. Compounding this problem, the genes identified in draft genome sequences are also affected by incomplete genome assemblies: the number and exact structure of predicted genes may be incorrect. Here we quantify the extent of such errors, by comparing several draft genomes against completed versions of the same sequences. Surprisingly, we find huge numbers of errors in the number of genes predicted from draft assemblies, with more than half of all genes having the wrong number of copies in the draft genomes examined. Our investigation also reveals the major causes of these errors, and further analyses using additional functional data demonstrate that many of the gene predictions can be corrected. The results presented here suggest that many inferences based on published draft genomes may be erroneous, but offer a way forward for future analyses.

Collapse

Ma C, Zhang HH, Wang X. Machine learning for Big Data analytics in plants. TRENDS IN PLANT SCIENCE 2014;19:798-808. [PMID: 25223304 DOI: 10.1016/j.tplants.2014.08.004] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 07/30/2014] [Accepted: 08/20/2014] [Indexed: 05/19/2023]

Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet 2013;14:157-67. [PMID: 23358380 DOI: 10.1038/nrg3367] [Citation(s) in RCA: 250] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. ACTA ACUST UNITED AC 2013;29:435-43. [PMID: 23303509 DOI: 10.1093/bioinformatics/bts723] [Citation(s) in RCA: 109] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]

Abstract

MOTIVATION

Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.

RESULTS

In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

AVAILABILITY

ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python.

Collapse

Lee H, Tang H. Next-generation sequencing technologies and fragment assembly algorithms. Methods Mol Biol 2012;855:155-174. [PMID: 22407708 DOI: 10.1007/978-1-61779-582-4_5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

Lai AG, Denton-Giles M, Mueller-Roeber B, Schippers JHM, Dijkwel PP. Positional information resolves structural variations and uncovers an evolutionarily divergent genetic locus in accessions of Arabidopsis thaliana. Genome Biol Evol 2011;3:627-40. [PMID: 21622917 PMCID: PMC3157834 DOI: 10.1093/gbe/evr038] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res 2010;20:675-84. [PMID: 20305016 DOI: 10.1101/gr.096966.109] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Detection and correction of false segmental duplications caused by genome mis-assembly. Genome Biol 2010;11:R28. [PMID: 20219098 PMCID: PMC2864568 DOI: 10.1186/gb-2010-11-3-r28] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2009] [Revised: 12/11/2009] [Accepted: 03/10/2010] [Indexed: 11/23/2022] Open

Pop M. Genome assembly reborn: recent computational challenges. Brief Bioinform 2009;10:354-66. [PMID: 19482960 DOI: 10.1093/bib/bbp026] [Citation(s) in RCA: 181] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open