Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 2007;17:1697-706. [PMID: 17908823 PMCID: PMC2045152 DOI: 10.1101/gr.6435207] [Citation(s) in RCA: 207] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

For:	Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 2007;17:1697-706. [PMID: 17908823 PMCID: PMC2045152 DOI: 10.1101/gr.6435207] [Citation(s) in RCA: 207] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Number

Cited by Other Article(s)

Rakkammal K, Priya A, Pandian S, Maharajan T, Rathinapriya P, Satish L, Ceasar SA, Sohn SI, Ramesh M. Conventional and Omics Approaches for Understanding the Abiotic Stress Response in Cereal Crops-An Updated Overview. PLANTS (BASEL, SWITZERLAND) 2022;11:plants11212852. [PMID: 36365305 PMCID: PMC9655223 DOI: 10.3390/plants11212852] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/19/2022] [Accepted: 10/22/2022] [Indexed: 05/22/2023]

Dida F, Yi G. Empirical evaluation of methods for de novo genome assembly. PeerJ Comput Sci 2021;7:e636. [PMID: 34307867 PMCID: PMC8279138 DOI: 10.7717/peerj-cs.636] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 06/19/2021] [Indexed: 06/12/2023]

Yao W, Li Y, Xie W, Wang L. Features of sRNA biogenesis in rice revealed by genetic dissection of sRNA expression level. Comput Struct Biotechnol J 2020;18:3207-3216. [PMID: 33209208 PMCID: PMC7649420 DOI: 10.1016/j.csbj.2020.10.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 09/24/2020] [Accepted: 10/11/2020] [Indexed: 01/25/2023] Open

Li F, Zhao X, Li M, He K, Huang C, Zhou Y, Li Z, Walters JR. Insect genomes: progress and challenges. INSECT MOLECULAR BIOLOGY 2019;28:739-758. [PMID: 31120160 DOI: 10.1111/imb.12599] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Revised: 03/22/2019] [Accepted: 05/14/2019] [Indexed: 05/24/2023]

GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019;2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]

Fu S, Chang PL, Friesen ML, Teakle NL, Tarone AM, Sze SH. Identifying similar transcripts in a related organism from de Bruijn graphs of RNA-Seq data, with applications to the study of salt and waterlogging tolerance in Melilotus. BMC Genomics 2019;20:425. [PMID: 31167652 PMCID: PMC6551239 DOI: 10.1186/s12864-019-5702-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Kwon D, Lee J, Kim J. GMASS: a novel measure for genome assembly structural similarity. BMC Bioinformatics 2019;20:147. [PMID: 30885117 PMCID: PMC6423833 DOI: 10.1186/s12859-019-2710-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 03/03/2019] [Indexed: 01/10/2023] Open

Yoon S, Kim D, Kang K, Park WJ. TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix. BMC Genomics 2018;19:653. [PMID: 30180798 PMCID: PMC6123912 DOI: 10.1186/s12864-018-5034-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 08/23/2018] [Indexed: 01/15/2023] Open

Abstract

BACKGROUND

The challenges when developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize a de Bruijn graph, with which complexity grows linearly with data size while suffering from errors and repeats. Although one can correct the errors by inspecting the topological structure of the graph, this is not an easy task when there are too many branches. Two research directions are to improve either the graph reliability or the path search precision, and in this study, we focused on the former.

RESULTS

We present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. In the proposed approach, we built contigs by direct read alignment within a distance margin and performed a junction search to construct splicing graphs. While doing so, a contig of length l was represented by a 4 × l matrix (called a consensus matrix), in which each element was the base count of the aligned reads so far. A representative sequence was obtained by taking the majority in each column of the consensus matrix to be used for further read alignment. Once the splicing graphs had been obtained, we used IsoLasso to find paths with a noticeable read depth. The experiments using real and simulated reads show that the method provided considerable improvement in sensitivity and moderately better performance when comparing sensitivity and precision. This was achieved by the error-aware graph construction using the consensus matrix, with which the reads having errors were made usable for the graph construction (otherwise, they might have been eventually discarded). This improved the quality of the coverage depth information used in the subsequent path search step and finally the reliability of the graph.

CONCLUSIONS

De novo assembly is mainly used to explore undiscovered isoforms and must be able to represent as many reads as possible in an efficient way. In this sense, TraRECo provides us with a potential alternative for improving graph reliability even though the computational burden is much higher than the single k-mer in the de Bruijn graph approach.

Collapse

Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP. Recent advances in sequence assembly: principles and applications. Brief Funct Genomics 2018;16:361-378. [PMID: 28453648 DOI: 10.1093/bfgp/elx006] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open

Si X, Wang Q, Zhang L, Wu R, Ma J. Survey of gene splicing algorithms based on reads. Bioengineered 2017;8:750-758. [PMID: 28873323 DOI: 10.1080/21655979.2017.1373538] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open

Li M, Wu B, Yan X, Luo J, Pan Y, Wu FX, Wang J. PECC: Correcting contigs based on paired-end read distribution. Comput Biol Chem 2017;69:178-184. [DOI: 10.1016/j.compbiolchem.2017.03.012] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 03/27/2017] [Indexed: 11/26/2022]

Li M, Liao Z, He Y, Wang J, Luo J, Pan Y. ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017;14:916-925. [PMID: 27076460 DOI: 10.1109/tcbb.2016.2550433] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Sze SH, Pimsler ML, Tomberlin JK, Jones CD, Tarone AM. A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms. BMC Genomics 2017;18:387. [PMID: 28589866 PMCID: PMC5461550 DOI: 10.1186/s12864-017-3735-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016;34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]

De novo assembly of transcriptome from next-generation sequencing data. QUANTITATIVE BIOLOGY 2016. [DOI: 10.1007/s40484-016-0069-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]

The A, C, G, and T of Genome Assembly. BIOMED RESEARCH INTERNATIONAL 2016;2016:6329217. [PMID: 27247941 PMCID: PMC4877455 DOI: 10.1155/2016/6329217] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 12/22/2015] [Indexed: 11/18/2022]

Tripathi R, Sharma P, Chakraborty P, Varadwaj PK. Next-generation sequencing revolution through big data analytics. FRONTIERS IN LIFE SCIENCE 2016. [DOI: 10.1080/21553769.2016.1178180] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]

Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine. Pharmaceutics 2016;8:E15. [PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 03/11/2016] [Accepted: 04/06/2016] [Indexed: 01/15/2023] Open

Abstract

Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.

Collapse

Al-Okaily AA. HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads. BMC Genomics 2016;17:193. [PMID: 26945881 PMCID: PMC4779561 DOI: 10.1186/s12864-016-2515-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 02/23/2016] [Indexed: 11/15/2022] Open

Langenkämper D, Jakobi T, Feld D, Jelonek L, Goesmann A, Nattkemper TW. Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations. Front Genet 2016;7:5. [PMID: 26904094 PMCID: PMC4748744 DOI: 10.3389/fgene.2016.00005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 01/17/2016] [Indexed: 12/27/2022] Open

Abstract

Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.

Collapse

Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review ofde novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Khiste N, Ilie L. LASER: Large genome ASsembly EvaluatoR. BMC Res Notes 2015;8:709. [PMID: 26601933 PMCID: PMC4657217 DOI: 10.1186/s13104-015-1682-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 11/09/2015] [Indexed: 11/10/2022] Open

Zhang F, Liao X, Peng S, Cui Y, Wang B, Zhu X, Liu J. A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer. Interdiscip Sci 2015;8:169-176. [PMID: 26403255 DOI: 10.1007/s12539-015-0127-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Revised: 07/23/2014] [Accepted: 09/03/2014] [Indexed: 11/26/2022]

Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico. BIOMED RESEARCH INTERNATIONAL 2015;2015:831352. [PMID: 26421304 PMCID: PMC4573434 DOI: 10.1155/2015/831352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 02/17/2015] [Accepted: 03/02/2015] [Indexed: 11/29/2022]

Nimmy SF, Kamal MS. Next generation sequencing under de novo genome assembly. INT J BIOMATH 2015. [DOI: 10.1142/s1793524515300018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]

Ghosh TS, Mehra V, Mande SS. Grid-Assembly: An oligonucleotide composition-based partitioning strategy to aid metagenomic sequence assembly. J Bioinform Comput Biol 2015;13:1541004. [PMID: 25790784 DOI: 10.1142/s0219720015410048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]

Sim M, Kim J. Metagenome assembly through clustering of next-generation sequencing data using protein sequences. J Microbiol Methods 2015;109:180-7. [PMID: 25572018 DOI: 10.1016/j.mimet.2015.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Revised: 01/03/2015] [Accepted: 01/03/2015] [Indexed: 11/16/2022]

Gai LP, Liu H, Cui JH, Ji N, Ding XD, Sun C, Yu LS. Distributions of allele combination in single and cross loci among patients with several kinds of chronic diseases and the normal population. Genomics 2015;105:168-74. [PMID: 25561352 DOI: 10.1016/j.ygeno.2014.12.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Revised: 12/05/2014] [Accepted: 12/27/2014] [Indexed: 11/27/2022]

Bratcher HB, Corton C, Jolley KA, Parkhill J, Maiden MCJ. A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes. BMC Genomics 2014;15:1138. [PMID: 25523208 PMCID: PMC4377854 DOI: 10.1186/1471-2164-15-1138] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2014] [Accepted: 12/04/2014] [Indexed: 12/25/2022] Open

Abstract

BACKGROUND

Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.

RESULTS

The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.

CONCLUSIONS

The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.

Collapse

Zhu X, Leung HCM, Chin FYL, Yiu SM, Quan G, Liu B, Wang Y. PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. PLoS One 2014;9:e114253. [PMID: 25461763 PMCID: PMC4252104 DOI: 10.1371/journal.pone.0114253] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 11/05/2014] [Indexed: 12/31/2022] Open

Abstract

Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from O_max to O_min to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.

Collapse

Luo J, Wang J, Zhang Z, Wu FX, Li M, Pan Y. EPGA: de novo assembly using the distributions of reads and insert size. ACTA ACUST UNITED AC 2014;31:825-33. [PMID: 25406329 DOI: 10.1093/bioinformatics/btu762] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Affiliation(s)

Junwei Luo School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
Jianxin Wang School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
Zhen Zhang School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
Fang-Xiang Wu School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
Min Li School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
Yi Pan School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA

Collapse

Sequence assembly using next generation sequencing data--challenges and solutions. SCIENCE CHINA-LIFE SCIENCES 2014;57:1140-8. [PMID: 25326069 DOI: 10.1007/s11427-014-4752-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 08/25/2014] [Indexed: 10/24/2022]

Bragg L, Tyson GW. Metagenomics using next-generation sequencing. Methods Mol Biol 2014;1096:183-201. [PMID: 24515370 DOI: 10.1007/978-1-62703-712-9_15] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]

Bao E, Jiang T, Girke T. AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics 2014;30:i319-i328. [PMID: 24932000 PMCID: PMC4058956 DOI: 10.1093/bioinformatics/btu291] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics 2014;15:302. [PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 08/01/2014] [Indexed: 11/10/2022] Open

Sze SH, Tarone AM. A memory-efficient algorithm to obtain splicing graphs and de novo expression estimates from de Bruijn graphs of RNA-Seq data. BMC Genomics 2014;15 Suppl 5:S6. [PMID: 25082000 PMCID: PMC4120145 DOI: 10.1186/1471-2164-15-s5-s6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Genomics and Proteomics of Foodborne Microorganisms. Food Microbiol 2014. [DOI: 10.1128/9781555818463.ch39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]

Reddy RM, Mohammed MH, Mande SS. MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 2014;103:161-8. [DOI: 10.1016/j.ygeno.2014.02.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Revised: 02/18/2014] [Accepted: 02/24/2014] [Indexed: 10/25/2022]

El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013;9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Ramos RTJ, Carneiro AR, Caracciolo PH, Azevedo V, Schneider MPC, Barh D, Silva A. Graphical contig analyzer for all sequencing platforms (G4ALL): a new stand-alone tool for finishing and draft generation of bacterial genomes. Bioinformation 2013;9:599-604. [PMID: 23888102 PMCID: PMC3717189 DOI: 10.6026/97320630009599] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2013] [Accepted: 05/27/2013] [Indexed: 11/23/2022] Open

Chu HT, Hsiao WWL, Tsao TTH, Hsu DF, Chen CC, Lee SA, Kao CY. SeqEntropy: genome-wide assessment of repeats for short read sequencing. PLoS One 2013;8:e59484. [PMID: 23544073 PMCID: PMC3609794 DOI: 10.1371/journal.pone.0059484] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2012] [Accepted: 02/14/2013] [Indexed: 11/18/2022] Open

Abstract

BACKGROUND

Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths.

METHODOLOGY/PRINCIPAL FINDINGS

We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures.

CONCLUSIONS/SIGNIFICANCE

The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.

Collapse

Fu N, Wang Q, Shen HL. De novo assembly, gene annotation and marker development using Illumina paired-end transcriptome sequences in celery (Apium graveolens L.). PLoS One 2013;8:e57686. [PMID: 23469050 PMCID: PMC3585167 DOI: 10.1371/journal.pone.0057686] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 01/23/2013] [Indexed: 12/11/2022] Open

Abstract

Background

Celery is an increasing popular vegetable species, but limited transcriptome and genomic data hinder the research to it. In addition, a lack of celery molecular markers limits the process of molecular genetic breeding. High-throughput transcriptome sequencing is an efficient method to generate a large transcriptome sequence dataset for gene discovery, molecular marker development and marker-assisted selection breeding.

Principal Findings

Celery transcriptomes from four tissues were sequenced using Illumina paired-end sequencing technology. De novo assembling was performed to generate a collection of 42,280 unigenes (average length of 502.6 bp) that represent the first transcriptome of the species. 78.43% and 48.93% of the unigenes had significant similarity with proteins in the National Center for Biotechnology Information (NCBI) non-redundant protein database (Nr) and Swiss-Prot database respectively, and 10,473 (24.77%) unigenes were assigned to Clusters of Orthologous Groups (COG). 21,126 (49.97%) unigenes harboring Interpro domains were annotated, in which 15,409 (36.45%) were assigned to Gene Ontology(GO) categories. Additionally, 7,478 unigenes were mapped onto 228 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). Large numbers of simple sequence repeats (SSRs) were indentified, and then the rate of successful amplication and polymorphism were investigated among 31 celery accessions.

Conclusions

This study demonstrates the feasibility of generating a large scale of sequence information by Illumina paired-end sequencing and efficient assembling. Our results provide a valuable resource for celery research. The developed molecular markers are the foundation of further genetic linkage analysis and gene localization, and they will be essential to accelerate the process of breeding.

Collapse

Rahman A, Pachter L. CGAL: computing genome assembly likelihoods. Genome Biol 2013;14:R8. [PMID: 23360652 PMCID: PMC3663106 DOI: 10.1186/gb-2013-14-1-r8] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2012] [Accepted: 01/29/2013] [Indexed: 01/12/2023] Open

Luo C, Rodriguez-R LM, Konstantinidis KT. A user's guide to quantitative and comparative analysis of metagenomic datasets. Methods Enzymol 2013;531:525-47. [PMID: 24060135 DOI: 10.1016/b978-0-12-407863-5.00023-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Lu M, Luo Q, Wang B, Wu J, Zhao J. GPU-Accelerated Bidirected De Bruijn Graph Construction for Genome Assembly. WEB TECHNOLOGIES AND APPLICATIONS 2013. [DOI: 10.1007/978-3-642-37401-2_8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]

Wang Y, Yu Y, Pan B, Hao P, Li Y, Shao Z, Xu X, Li X. Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome. BMC SYSTEMS BIOLOGY 2012;6 Suppl 3:S21. [PMID: 23282199 PMCID: PMC3524012 DOI: 10.1186/1752-0509-6-s3-s21] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]

Abstract

Background

Sequencing of bacterial genomes became an essential approach to study pathogen virulence and the phylogenetic relationship among close related strains. Bacterium Enterococcus faecium emerged as an important nosocomial pathogen that were often associated with resistance to common antibiotics in hospitals. With highly divergent gene contents, it presented a challenge to the next generation sequencing (NGS) technologies featuring high-throughput and shorter read-length. This study was designed to investigate the properties and systematic biases of NGS technologies and evaluate critical parameters influencing the outcomes of hybrid assemblies using combinations of NGS data.

Results

A hospital strain of E. faecium was sequenced using three different NGS platforms: 454 GS-FLX, Illumina GAIIx, and ABI SOLiD4.0, to approximately 28-, 500-, and 400-fold coverage depth. We built a pipeline that merged contigs from each NGS data into hybrid assemblies. The results revealed that each single NGS assembly had a ceiling in continuity that could not be overcome by simply increasing data coverage depth. Each NGS technology displayed some intrinsic properties, i.e. base calling error, systematic bias, etc. The gaps and low coverage regions of each NGS assembly were associated with lower GC contents. In order to optimize the hybrid assembly approach, we tested with varying amount and different combination of NGS data, and obtained optimal conditions for assembly continuity. We also, for the first time, showed that SOLiD data could help make much improved assemblies of E. faecium genome using the hybrid approach when combined with other type of NGS data.

Conclusions

The current study addressed the difficult issue of how to most effectively construct a complete microbial genome using today's state of the art sequencing technologies. We characterized the sequence data and genome assembly from each NGS technologies, tested conditions for hybrid assembly with combinations of NGS data, and obtained optimized parameters for achieving most cost-efficiency assembly. Our study helped form some guidelines to direct genomic work on other microorganisms, thus have important practical implications.

Collapse

Elucidation of bacterial genome complexity using next-generation sequencing. BIOTECHNOL BIOPROC E 2012. [DOI: 10.1007/s12257-012-0374-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

Geng C, Chen Y, Wu K, Cai Q, Wang Y, Lang Y, Cao H, Yang H, Wang J, Zhang X. Paired-end sequencing of long-range DNA fragments for de novo assembly of large, complex Mammalian genomes by direct intra-molecule ligation. PLoS One 2012;7:e46211. [PMID: 23029438 PMCID: PMC3459883 DOI: 10.1371/journal.pone.0046211] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2012] [Accepted: 08/28/2012] [Indexed: 11/19/2022] Open

Tanaseichuk O, Borneman J, Jiang T. Separating metagenomic short reads into genomes via clustering. Algorithms Mol Biol 2012;7:27. [PMID: 23009059 PMCID: PMC3537596 DOI: 10.1186/1748-7188-7-27] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Accepted: 09/14/2012] [Indexed: 01/21/2023] Open

Abstract

Background

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels.

Results

In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/∼tanaseio/toss.htm.

Conclusions

Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

Collapse