1
|
Rakkammal K, Priya A, Pandian S, Maharajan T, Rathinapriya P, Satish L, Ceasar SA, Sohn SI, Ramesh M. Conventional and Omics Approaches for Understanding the Abiotic Stress Response in Cereal Crops-An Updated Overview. PLANTS (BASEL, SWITZERLAND) 2022; 11:plants11212852. [PMID: 36365305 PMCID: PMC9655223 DOI: 10.3390/plants11212852] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/19/2022] [Accepted: 10/22/2022] [Indexed: 05/22/2023]
Abstract
Cereals have evolved various tolerance mechanisms to cope with abiotic stress. Understanding the abiotic stress response mechanism of cereal crops at the molecular level offers a path to high-yielding and stress-tolerant cultivars to sustain food and nutritional security. In this regard, enormous progress has been made in the omics field in the areas of genomics, transcriptomics, and proteomics. Omics approaches generate a massive amount of data, and adequate advancements in computational tools have been achieved for effective analysis. The combination of integrated omics and bioinformatics approaches has been recognized as vital to generating insights into genome-wide stress-regulation mechanisms. In this review, we have described the self-driven drought, heat, and salt stress-responsive mechanisms that are highlighted by the integration of stress-manipulating components, including transcription factors, co-expressed genes, proteins, etc. This review also provides a comprehensive catalog of available online omics resources for cereal crops and their effective utilization. Thus, the details provided in the review will enable us to choose the appropriate tools and techniques to reduce the negative impacts and limit the failures in the intensive crop improvement study.
Collapse
Affiliation(s)
- Kasinathan Rakkammal
- Department of Biotechnology, Science Campus, Alagappa University, Karaikudi 630003, Tamil Nadu, India
| | - Arumugam Priya
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27606, USA
| | - Subramani Pandian
- Department of Agricultural Biotechnology, National Institute of Agricultural Sciences, Rural Development Administration, Jeonju 54874, Korea
| | - Theivanayagam Maharajan
- Department of Biosciences, Rajagiri College of Social Sciences, Cochin 683104, Kerala, India
| | - Periyasamy Rathinapriya
- Department of Biotechnology, Science Campus, Alagappa University, Karaikudi 630003, Tamil Nadu, India
| | - Lakkakula Satish
- Applied Phycology and Biotechnology Division, Marine Algal Research Station, Mandapam Camp, CSIR—Central Salt and Marine Chemicals Research Institute, Bhavnagar 623519, Tamil Nadu, India
| | | | - Soo-In Sohn
- Department of Agricultural Biotechnology, National Institute of Agricultural Sciences, Rural Development Administration, Jeonju 54874, Korea
| | - Manikandan Ramesh
- Department of Biotechnology, Science Campus, Alagappa University, Karaikudi 630003, Tamil Nadu, India
- Correspondence:
| |
Collapse
|
2
|
Dida F, Yi G. Empirical evaluation of methods for de novo genome assembly. PeerJ Comput Sci 2021; 7:e636. [PMID: 34307867 PMCID: PMC8279138 DOI: 10.7717/peerj-cs.636] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 06/19/2021] [Indexed: 06/12/2023]
Abstract
Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm's performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms.
Collapse
Affiliation(s)
- Firaol Dida
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| |
Collapse
|
3
|
Yao W, Li Y, Xie W, Wang L. Features of sRNA biogenesis in rice revealed by genetic dissection of sRNA expression level. Comput Struct Biotechnol J 2020; 18:3207-3216. [PMID: 33209208 PMCID: PMC7649420 DOI: 10.1016/j.csbj.2020.10.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 09/24/2020] [Accepted: 10/11/2020] [Indexed: 01/25/2023] Open
Abstract
We previously conducted a QTL analysis of small RNA (sRNA) abundance in flag leaves of an immortalized rice F2 (IMF2) population by aligning sRNA reads to the reference genome to quantify the expression levels of sRNAs. However, this approach missed about half of the sRNAs as only 50% of all sRNA reads could be uniquely aligned to the reference genome. Here, we quantified the expression levels of sRNAs and sRNA clusters without the use of a reference genome. QTL analysis of the expression levels of sRNAs and sRNA clusters confirmed the feasibility of this approach. sRNAs and sRNA clusters with identified QTLs were then aligned to the high-quality parental genomes of the IMF2 population to resolve the identified QTLs into local vs. distant regulation mode. We were able to detect new QTL hotspots by considering sRNAs aligned to multiple positions of the parental genomes and sRNAs unaligned to the parental genomes. We found that several local-QTL hotspots were caused by sequence variations in long inverted repeats, which probably function as precursors of sRNAs, between the two parental genomes. The expression levels of these sRNAs were significantly associated with the presence/absence of the long inverted repeats in the IMF2 population. Moreover, we found that the variations in whole-genome sRNA species composition among different IMF2s were attributed to sRNA biogenesis genes including OsDCL2b and OsRDR2. Our results highlight that genetic dissection of sRNA expression is a promising approach to disclose new components functioning in sRNA biogenesis and new mechanisms of sRNA biogenesis.
Collapse
Affiliation(s)
- Wen Yao
- National Key Laboratory of Wheat and Maize Crop Science, College of Life Sciences, Henan Agricultural University, Zhengzhou 450002, China.,National Key Laboratory of Crop Genetic Improvement, National Center of Plant Gene Research (Wuhan), Huazhong Agricultural University, Wuhan 430070, China
| | - Yang Li
- National Key Laboratory of Wheat and Maize Crop Science, College of Life Sciences, Henan Agricultural University, Zhengzhou 450002, China
| | - Weibo Xie
- National Key Laboratory of Crop Genetic Improvement, National Center of Plant Gene Research (Wuhan), Huazhong Agricultural University, Wuhan 430070, China
| | - Lei Wang
- National Key Laboratory of Crop Genetic Improvement, National Center of Plant Gene Research (Wuhan), Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
4
|
Li F, Zhao X, Li M, He K, Huang C, Zhou Y, Li Z, Walters JR. Insect genomes: progress and challenges. INSECT MOLECULAR BIOLOGY 2019; 28:739-758. [PMID: 31120160 DOI: 10.1111/imb.12599] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Revised: 03/22/2019] [Accepted: 05/14/2019] [Indexed: 05/24/2023]
Abstract
In the wake of constant improvements in sequencing technologies, numerous insect genomes have been sequenced. Currently, 1219 insect genome-sequencing projects have been registered with the National Center for Biotechnology Information, including 401 that have genome assemblies and 155 with an official gene set of annotated protein-coding genes. Comparative genomics analysis showed that the expansion or contraction of gene families was associated with well-studied physiological traits such as immune system, metabolic detoxification, parasitism and polyphagy in insects. Here, we summarize the progress of insect genome sequencing, with an emphasis on how this impacts research on pest control. We begin with a brief introduction to the basic concepts of genome assembly, annotation and metrics for evaluating the quality of draft assemblies. We then provide an overview of genome information for numerous insect species, highlighting examples from prominent model organisms, agricultural pests and disease vectors. We also introduce the major insect genome databases. The increasing availability of insect genomic resources is beneficial for developing alternative pest control methods. However, many opportunities remain for developing data-mining tools that make maximal use of the available insect genome resources. Although rapid progress has been achieved, many challenges remain in the field of insect genomics.
Collapse
Affiliation(s)
- F Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - X Zhao
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - M Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - K He
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - C Huang
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - Y Zhou
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - Z Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - J R Walters
- Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA
| |
Collapse
|
5
|
GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019; 2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]
Abstract
Genomic analysis begins with de novo assembly of short-read fragments in order to reconstruct full-length base sequences without exploiting a reference genome sequence. Then, in the annotation step, gene locations are identified within the base sequences, and the structures and functions of these genes are determined. Recently, a wide range of powerful tools have been developed and published for whole-genome analysis, enabling even individual researchers in small laboratories to perform whole-genome analyses on their objects of interest. However, these analytical tools are generally complex and use diverse algorithms, parameter setting methods, and input formats; thus, it remains difficult for individual researchers to select, utilize, and combine these tools to obtain their final results. To resolve these issues, we have developed a genome analysis pipeline (GAAP) for semiautomated, iterative, and high-throughput analysis of whole-genome data. This pipeline is designed to perform read correction, de novo genome (transcriptome) assembly, gene prediction, and functional annotation using a range of proven tools and databases. We aim to assist non-IT researchers by describing each stage of analysis in detail and discussing current approaches. We also provide practical advice on how to access and use the bioinformatics tools and databases and how to implement the provided suggestions. Whole-genome analysis of Toxocara canis is used as case study to show intermediate results at each stage, demonstrating the practicality of the proposed method.
Collapse
|
6
|
Fu S, Chang PL, Friesen ML, Teakle NL, Tarone AM, Sze SH. Identifying similar transcripts in a related organism from de Bruijn graphs of RNA-Seq data, with applications to the study of salt and waterlogging tolerance in Melilotus. BMC Genomics 2019; 20:425. [PMID: 31167652 PMCID: PMC6551239 DOI: 10.1186/s12864-019-5702-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background A popular strategy to study alternative splicing in non-model organisms starts from sequencing the entire transcriptome, then assembling the reads by using de novo transcriptome assembly algorithms to obtain predicted transcripts. A similarity search algorithm is then applied to a related organism to infer possible function of these predicted transcripts. While some of these predictions may be inaccurate and transcripts with low coverage are often missed, we observe that it is possible to obtain a more complete set of transcripts to facilitate possible functional assignments by starting the search from the intermediate de Bruijn graph that contains all branching possibilities. Results We develop an algorithm to extract similar transcripts in a related organism by starting the search from the de Bruijn graph that represents the transcriptome instead of from predicted transcripts. We show that our algorithm is able to recover more similar transcripts than existing algorithms, with large improvements in obtaining longer transcripts and a finer resolution of isoforms. We apply our algorithm to study salt and waterlogging tolerance in two Melilotus species by constructing new RNA-Seq libraries. Conclusions We have developed an algorithm to identify paths in the de Bruijn graph that correspond to similar transcripts in a related organism directly. Our strategy bypasses the transcript prediction step in RNA-Seq data and makes use of support from evolutionary information. Electronic supplementary material The online version of this article (10.1186/s12864-019-5702-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shuhua Fu
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, 77843, TX, USA
| | - Peter L Chang
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, 90089, CA, USA
| | - Maren L Friesen
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, 90089, CA, USA.,Department of Crop and Soil Sciences, Washington State University, Pullman, 99164, WA, USA.,Department of Plant Pathology, Washington State University, Pullman, 99164, WA, USA
| | - Natasha L Teakle
- Centre for Ecohydrology, The University of Western Australia, 35 Stirling Highway, Crawley, 6009, WA, Australia.,School of Plant Biology (M084), Faculty of Natural and Agricultural Sciences, The University of Western Australia, 35 Stirling Highway, Crawley, 6009, WA, Australia
| | - Aaron M Tarone
- Department of Entomology, Texas A&M University, College Station, 77843, TX, USA
| | - Sing-Hoi Sze
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, 77843, TX, USA. .,Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, TX, USA.
| |
Collapse
|
7
|
Kwon D, Lee J, Kim J. GMASS: a novel measure for genome assembly structural similarity. BMC Bioinformatics 2019; 20:147. [PMID: 30885117 PMCID: PMC6423833 DOI: 10.1186/s12859-019-2710-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 03/03/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Thanks to the recent advancements in next-generation sequencing (NGS) technologies, large amount of genomic data, which are short DNA sequences known as reads, has been accumulating. Diverse assemblers have been developed to generate high quality de novo assemblies using the NGS reads, but their output is very different because of algorithmic differences. However, there are not properly structured measures to show the similarity or difference in assemblies. RESULTS We developed a new measure, called the GMASS score, for comparing two genome assemblies in terms of their structure. The GMASS score was developed based on the distribution pattern of the number and coverage of similar regions between a pair of assemblies. The new measure was able to show structural similarity between assemblies when evaluated by simulated assembly datasets. The application of the GMASS score to compare assemblies in recently published benchmark datasets showed the divergent performance of current assemblers as well as its ability to compare assemblies. CONCLUSION The GMASS score is a novel measure for representing structural similarity between two assemblies. It will contribute to the understanding of assembly output and developing de novo assemblers.
Collapse
Affiliation(s)
- Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea.
| |
Collapse
|
8
|
Yoon S, Kim D, Kang K, Park WJ. TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix. BMC Genomics 2018; 19:653. [PMID: 30180798 PMCID: PMC6123912 DOI: 10.1186/s12864-018-5034-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 08/23/2018] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND The challenges when developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize a de Bruijn graph, with which complexity grows linearly with data size while suffering from errors and repeats. Although one can correct the errors by inspecting the topological structure of the graph, this is not an easy task when there are too many branches. Two research directions are to improve either the graph reliability or the path search precision, and in this study, we focused on the former. RESULTS We present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. In the proposed approach, we built contigs by direct read alignment within a distance margin and performed a junction search to construct splicing graphs. While doing so, a contig of length l was represented by a 4 × l matrix (called a consensus matrix), in which each element was the base count of the aligned reads so far. A representative sequence was obtained by taking the majority in each column of the consensus matrix to be used for further read alignment. Once the splicing graphs had been obtained, we used IsoLasso to find paths with a noticeable read depth. The experiments using real and simulated reads show that the method provided considerable improvement in sensitivity and moderately better performance when comparing sensitivity and precision. This was achieved by the error-aware graph construction using the consensus matrix, with which the reads having errors were made usable for the graph construction (otherwise, they might have been eventually discarded). This improved the quality of the coverage depth information used in the subsequent path search step and finally the reliability of the graph. CONCLUSIONS De novo assembly is mainly used to explore undiscovered isoforms and must be able to represent as many reads as possible in an efficient way. In this sense, TraRECo provides us with a potential alternative for improving graph reliability even though the computational burden is much higher than the single k-mer in the de Bruijn graph approach.
Collapse
Affiliation(s)
- Seokhyun Yoon
- Department of Electronics Eng., College of Engineering, Dankook University, Yongin-si, Korea
| | - Daeseung Kim
- Department of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea
| | - Keunsoo Kang
- Department of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea.
| | - Woong June Park
- Department of Molecular Biology, College of Natural Sciences, Dankook University, Cheonan-si, Korea
| |
Collapse
|
9
|
Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP. Recent advances in sequence assembly: principles and applications. Brief Funct Genomics 2018; 16:361-378. [PMID: 28453648 DOI: 10.1093/bfgp/elx006] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly.
Collapse
|
10
|
Abstract
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made.
Collapse
Affiliation(s)
- Xiuhua Si
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Qian Wang
- b Shandong Aerospace Institute of Electronic Technology , Yantai , China
| | - Lei Zhang
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Ruo Wu
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Jiquan Ma
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| |
Collapse
|
11
|
Li M, Wu B, Yan X, Luo J, Pan Y, Wu FX, Wang J. PECC: Correcting contigs based on paired-end read distribution. Comput Biol Chem 2017; 69:178-184. [DOI: 10.1016/j.compbiolchem.2017.03.012] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 03/27/2017] [Indexed: 11/26/2022]
|
12
|
Li M, Liao Z, He Y, Wang J, Luo J, Pan Y. ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:916-925. [PMID: 27076460 DOI: 10.1109/tcbb.2016.2550433] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The purpose of de novo assembly is to report more contiguous, complete, and less error prone contigs. Thanks to the advent of the next generation sequencing (NGS) technologies, the cost of producing high depth reads is reduced greatly. However, due to the disadvantages of NGS, de novo assembly has to face the difficulties brought by repeat regions, error rate, and low sequencing coverage in some regions. Although many de novo algorithms have been proposed to solve these problems, the de novo assembly still remains a challenge. In this article, we developed an iterative seed-extension algorithm for de novo assembly, called ISEA. To avoid the negative impact induced by error rate, ISEA utilizes reads overlap and paired-end information to correct error reads before assemblying. During extending seeds in a De Bruijn graph, ISEA uses an elaborately designed score function based on paired-end information and the distribution of insert size to solve the repeat region problem. By employing the distribution of insert size, the score function can also reduce the influence of error reads. In scaffolding, ISEA adopts a relaxed strategy to join contigs that were terminated for low coverage during the extension. The performance of ISEA was compared with six previous popular assemblers on four real datasets. The experimental results demonstrate that ISEA can effectively obtain longer and more accurate scaffolds.
Collapse
|
13
|
Sze SH, Pimsler ML, Tomberlin JK, Jones CD, Tarone AM. A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms. BMC Genomics 2017; 18:387. [PMID: 28589866 PMCID: PMC5461550 DOI: 10.1186/s12864-017-3735-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries. Results We develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory. Conclusions Our strategy minimizes memory consumption while simultaneously obtaining comparable or improved accuracy over existing algorithms. It provides support for incremental updates of assemblies when new libraries become available.
Collapse
Affiliation(s)
- Sing-Hoi Sze
- Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, TX, USA. .,Department of Biochemistry & Biophysics, Texas A&M University, College Station, 77843, TX, USA.
| | - Meaghan L Pimsler
- Department of Entomology, Texas A&M University, College Station, 77843, TX, USA
| | - Jeffery K Tomberlin
- Department of Entomology, Texas A&M University, College Station, 77843, TX, USA
| | - Corbin D Jones
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Aaron M Tarone
- Department of Entomology, Texas A&M University, College Station, 77843, TX, USA
| |
Collapse
|
14
|
Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016; 34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]
|
15
|
|
16
|
The A, C, G, and T of Genome Assembly. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6329217. [PMID: 27247941 PMCID: PMC4877455 DOI: 10.1155/2016/6329217] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 12/22/2015] [Indexed: 11/18/2022]
Abstract
Genome assembly in its two decades of history has produced significant research, in terms of both biotechnology and computational biology. This contribution delineates sequencing platforms and their characteristics, examines key steps involved in filtering and processing raw data, explains assembly frameworks, and discusses quality statistics for the assessment of the assembled sequence. Furthermore, the paper explores recent Ubuntu-based software environments oriented towards genome assembly as well as some avenues for future research.
Collapse
|
17
|
Tripathi R, Sharma P, Chakraborty P, Varadwaj PK. Next-generation sequencing revolution through big data analytics. FRONTIERS IN LIFE SCIENCE 2016. [DOI: 10.1080/21553769.2016.1178180] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
18
|
Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine. Pharmaceutics 2016; 8:E15. [PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 03/11/2016] [Accepted: 04/06/2016] [Indexed: 01/15/2023] Open
Abstract
Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.
Collapse
Affiliation(s)
- Wenming Xiao
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Leihong Wu
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Gokhan Yavas
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
| | - Baitang Ning
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
19
|
Al-Okaily AA. HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads. BMC Genomics 2016; 17:193. [PMID: 26945881 PMCID: PMC4779561 DOI: 10.1186/s12864-016-2515-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 02/23/2016] [Indexed: 11/15/2022] Open
Abstract
Background Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage. Results In this paper, we introduce a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. Conclusions We empirically evaluated this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100 bp Illumina HiSeq and 250 bp Illumina MiSeq reads, with coverage ranging from 100x– ∼200x. The results show that for all evaluated datasets and using most evaluated assemblers (that were used to assemble the disjoint subsets), HGA leads to a significant improvement in the quality of the assembly based on N50 and corrected N50 metrics. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2515-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Anas A Al-Okaily
- Computer Science & Engineering Department, University of Connecticut, Storrs, 06269, CT, USA.
| |
Collapse
|
20
|
Langenkämper D, Jakobi T, Feld D, Jelonek L, Goesmann A, Nattkemper TW. Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations. Front Genet 2016; 7:5. [PMID: 26904094 PMCID: PMC4748744 DOI: 10.3389/fgene.2016.00005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 01/17/2016] [Indexed: 12/27/2022] Open
Abstract
Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.
Collapse
Affiliation(s)
- Daniel Langenkämper
- Biodata Mining Group, Faculty of Technology, Bielefeld University Bielefeld, Germany
| | - Tobias Jakobi
- Sektion für Bioinformatik und Systemkardiologie, Universitätsklinikum Heidelberg Heidelberg, Germany
| | | | - Lukas Jelonek
- Bioinformatik und Systembiologie, Justus Liebig University Gießen, Germany
| | - Alexander Goesmann
- Bioinformatik und Systembiologie, Justus Liebig University Gießen, Germany
| | - Tim W Nattkemper
- Biodata Mining Group, Faculty of Technology, Bielefeld University Bielefeld, Germany
| |
Collapse
|
21
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
22
|
Khiste N, Ilie L. LASER: Large genome ASsembly EvaluatoR. BMC Res Notes 2015; 8:709. [PMID: 26601933 PMCID: PMC4657217 DOI: 10.1186/s13104-015-1682-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 11/09/2015] [Indexed: 11/10/2022] Open
Abstract
Background Genome assembly is a fundamental problem with multiple applications. Current technological limitations do not allow assembling of entire genomes and many programs have been designed to produce longer and more reliable contigs. Assessing the quality of these assemblies and comparing those produced by different tools is essential
in choosing the best ones. The QUAST program has become the current state-of-the-art in quality assessment of genome assemblies. The only drawback of QUAST is high time and memory usage for large genomes, e.g., over 4 days and 120 GB of RAM for a single human genome assembly. Results We introduce LASER, a new tool for assembly evaluation that improves greatly the speed and memory requirements of QUAST. For a human genome assembly, LASER is 5.6 times faster than QUAST while using only half the memory; one human genome assembly is evaluated in 17 hours instead of 4 days. The code of LASER is based on
that of QUAST and therefore inherits all its features. Conclusions Genome assembly evaluation is an essential step in assessing the quality of an assembly that is too often done improperly, in part due to significant resource consumption. With the introduction of LASER, proper evaluation can be performed efficiently.
Collapse
Affiliation(s)
- Nilesh Khiste
- Department of Computer Science, University of Western Ontario, London, ON, N6A 5B7, Canada.
| | - Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, ON, N6A 5B7, Canada.
| |
Collapse
|
23
|
Zhang F, Liao X, Peng S, Cui Y, Wang B, Zhu X, Liu J. A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer. Interdiscip Sci 2015; 8:169-176. [PMID: 26403255 DOI: 10.1007/s12539-015-0127-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Revised: 07/23/2014] [Accepted: 09/03/2014] [Indexed: 11/26/2022]
Abstract
' The de novo assembly of DNA sequences is increasingly important for biological researches in the genomic era. After more than one decade since the Human Genome Project, some challenges still exist and new solutions are being explored to improve de novo assembly of genomes. String graph assembler (SGA), based on the string graph theory, is a new method/tool developed to address the challenges. In this paper, based on an in-depth analysis of SGA we prove that the SGA-based sequence de novo assembly is an NP-complete problem. According to our analysis, SGA outperforms other similar methods/tools in memory consumption, but costs much more time, of which 60-70 % is spent on the index construction. Upon this analysis, we introduce a hybrid parallel optimization algorithm and implement this algorithm in the TianHe-2's parallel framework. Simulations are performed with different datasets. For data of small size the optimized solution is 3.06 times faster than before, and for data of middle size it's 1.60 times. The results demonstrate an evident performance improvement, with the linear scalability for parallel FM-index construction. This results thus contribute significantly to improving the efficiency of de novo assembly of DNA sequences.
Collapse
Affiliation(s)
- Feng Zhang
- Department of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Xiangke Liao
- Department of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Shaoliang Peng
- Department of Computer Science, National University of Defense Technology, Changsha, 410073, China.
| | - Yingbo Cui
- Department of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Bingqiang Wang
- National Supercomputing Center in Shenzhen, Shenzhen, 518055, China
| | - Xiaoqian Zhu
- Department of Computer Science, National University of Defense Technology, Changsha, 410073, China.
| | - Jie Liu
- Department of Computer Science, National University of Defense Technology, Changsha, 410073, China
| |
Collapse
|
24
|
Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico. BIOMED RESEARCH INTERNATIONAL 2015; 2015:831352. [PMID: 26421304 PMCID: PMC4573434 DOI: 10.1155/2015/831352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 02/17/2015] [Accepted: 03/02/2015] [Indexed: 11/29/2022]
Abstract
Next-generation sequencing techniques have been rapidly emerging. However, the massive sequencing reads hide a great deal of unknown important information. Advances have enabled researchers to discover alternative splicing (AS) sites and isoforms using computational approaches instead of molecular experiments. Given the importance of AS for gene expression and protein diversity in eukaryotes, detecting alternative splicing and isoforms represents a hot topic in systems biology and epigenetics research. The computational methods applied to AS prediction have improved since the emergence of next-generation sequencing. In this study, we introduce state-of-the-art research on AS and then compare the research methods and software tools available for AS based on next-generation sequencing reads. Finally, we discuss the prospects of computational methods related to AS.
Collapse
|
25
|
Abstract
The next generation sequencing (NGS) is an important process which assures inexpensive organization of vast size of raw sequence dataset over any traditional sequencing systems or methods. Various aspects of NGS such as template preparation, sequencing imaging and genome alignment and assembly outline the genome sequencing and alignment. Consequently, de Bruijn graph (dBG) is an important mathematical tool that graphically analyzes how the orientations are constructed in groups of nucleotides. Basically, dBG describes the formation of the genome segments in circular iterative fashions. Some pivotal dBG-based de novo algorithms and software packages such as T-IDBA, Oases, IDBA-tran, Euler, Velvet, ABySS, AllPaths, SOAPde novo and SOAPde novo2 are illustrated in this paper. Consequently, overlap layout consensus (OLC) graph-based algorithms also play vital role in NGS assembly. Some important OLC-based algorithms such as MIRA3, CABOG, Newbler, Edena, Mosaik and SHORTY are portrayed in this paper. It has been experimented that greedy graph-based algorithms and software packages are also vital for proper genome dataset assembly. A few algorithms named SSAKE, SHARCGS and VCAKE help to perform proper genome sequencing.
Collapse
Affiliation(s)
- Sonia Farhana Nimmy
- Department of Computer Science and Engineering, BGC Trust University, BGC Biddha Nagar, Chandanaish, Chittagong, Bangladesh
| | - M. S. Kamal
- Department of Computer Science and Engineering, BGC Trust University, BGC Biddha Nagar, Chandanaish, Chittagong, Bangladesh
| |
Collapse
|
26
|
Ghosh TS, Mehra V, Mande SS. Grid-Assembly: An oligonucleotide composition-based partitioning strategy to aid metagenomic sequence assembly. J Bioinform Comput Biol 2015; 13:1541004. [PMID: 25790784 DOI: 10.1142/s0219720015410048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Metagenomics approach involves extraction, sequencing and characterization of the genomic content of entire community of microbes present in a given environment. In contrast to genomic data, accurate assembly of metagenomic sequences is a challenging task. Given the huge volume and the diverse taxonomic origin of metagenomic sequences, direct application of single genome assembly methods on metagenomes are likely to not only lead to an immense increase in requirements of computational infrastructure, but also result in the formation of chimeric contigs. A strategy to address the above challenge would be to partition metagenomic sequence datasets into clusters and assemble separately the sequences in individual clusters using any single-genome assembly method. The current study presents such an approach that uses tetranucleotide usage patterns to first represent sequences as points in a three dimensional (3D) space. The 3D space is subsequently partitioned into "Grids". Sequences within overlapping grids are then progressively assembled using any available assembler. We demonstrate the applicability of the current Grid-Assembly method using various categories of assemblers as well as different simulated metagenomic datasets. Validation results indicate that the Grid-Assembly approach helps in improving the overall quality of assembly, in terms of the purity and volume of the assembled contigs.
Collapse
Affiliation(s)
- Tarini Shankar Ghosh
- Biosciences R&D Division, TCS Innovation Labs, 54-B Hadapsar Industrial Estate, Pune, Maharashtra 411013, India
| | | | | |
Collapse
|
27
|
Sim M, Kim J. Metagenome assembly through clustering of next-generation sequencing data using protein sequences. J Microbiol Methods 2015; 109:180-7. [PMID: 25572018 DOI: 10.1016/j.mimet.2015.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Revised: 01/03/2015] [Accepted: 01/03/2015] [Indexed: 11/16/2022]
Abstract
The study of environmental microbial communities, called metagenomics, has gained a lot of attention because of the recent advances in next-generation sequencing (NGS) technologies. Microbes play a critical role in changing their environments, and the mode of their effect can be solved by investigating metagenomes. However, the difficulty of metagenomes, such as the combination of multiple microbes and different species abundance, makes metagenome assembly tasks more challenging. In this paper, we developed a new metagenome assembly method by utilizing protein sequences, in addition to the NGS read sequences. Our method (i) builds read clusters by using mapping information against available protein sequences, and (ii) creates contig sequences by finding consensus sequences through probabilistic choices from the read clusters. By using simulated NGS read sequences from real microbial genome sequences, we evaluated our method in comparison with four existing assembly programs. We found that our method could generate relatively long and accurate metagenome assemblies, indicating that the idea of using protein sequences, as a guide for the assembly, is promising.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Jaebum Kim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| |
Collapse
|
28
|
Gai LP, Liu H, Cui JH, Ji N, Ding XD, Sun C, Yu LS. Distributions of allele combination in single and cross loci among patients with several kinds of chronic diseases and the normal population. Genomics 2015; 105:168-74. [PMID: 25561352 DOI: 10.1016/j.ygeno.2014.12.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Revised: 12/05/2014] [Accepted: 12/27/2014] [Indexed: 11/27/2022]
Abstract
Genetic research has progressed along with scientific and technological developments. However, it is difficult to identify frequency differences in a particular allele distribution at a single locus. Such differences can be identified by examining the allele combination distribution. We explored different mathematical methods for statistical analyses to assess the association between the genotype and phenotype. We investigated the frequency distributions of alleles, combinations of single-locus genes, and combinations of cross-loci genes at 15 loci using 447 blood samples of 200 normal subjects, 72 patients with chronic obstructive pulmonary resistance, 50 liver cancers, 75 stomach cancers and 50 hematencephalon and identified each population as having a unique gene distribution and that the distribution followed certain rules. The probability of illness followed different rules and had apparent specificity. Differences obtained using statistics of combinations of cross-loci genes are superior to single-locus gene statistics, and combinations of single-locus gene statistics are better than allelic statistics.
Collapse
Affiliation(s)
- Li-ping Gai
- Department of Physics, Dalian Medical University, Dalian 116044, China
| | - Hui Liu
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China.
| | - Jing-hui Cui
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Na Ji
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Xiao-dong Ding
- Department of Physics, Dalian Medical University, Dalian 116044, China
| | - Cui Sun
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Lai-shui Yu
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| |
Collapse
|
29
|
Bratcher HB, Corton C, Jolley KA, Parkhill J, Maiden MCJ. A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes. BMC Genomics 2014; 15:1138. [PMID: 25523208 PMCID: PMC4377854 DOI: 10.1186/1471-2164-15-1138] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2014] [Accepted: 12/04/2014] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary. RESULTS The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database. CONCLUSIONS The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.
Collapse
|
30
|
Zhu X, Leung HCM, Chin FYL, Yiu SM, Quan G, Liu B, Wang Y. PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. PLoS One 2014; 9:e114253. [PMID: 25461763 PMCID: PMC4252104 DOI: 10.1371/journal.pone.0114253] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 11/05/2014] [Indexed: 12/31/2022] Open
Abstract
Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.
Collapse
Affiliation(s)
- Xiao Zhu
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | | | | | - Siu Ming Yiu
- Department of Computer Science, University of Hong Kong, Hong Kong
| | - Guangri Quan
- National Pilot School of Software, Harbin Institute of Technology, Weihai, Shandong, China
| | - Bo Liu
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Yadong Wang
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
- * E-mail:
| |
Collapse
|
31
|
Luo J, Wang J, Zhang Z, Wu FX, Li M, Pan Y. EPGA: de novo assembly using the distributions of reads and insert size. ACTA ACUST UNITED AC 2014; 31:825-33. [PMID: 25406329 DOI: 10.1093/bioinformatics/btu762] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
MOTIVATION In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results. RESULTS In this article, we develop an algorithm, called extract paths for genome assembly (EPGA), which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds.
Collapse
Affiliation(s)
- Junwei Luo
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Zhen Zhang
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Fang-Xiang Wu
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Min Li
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Yi Pan
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| |
Collapse
|
32
|
Sequence assembly using next generation sequencing data--challenges and solutions. SCIENCE CHINA-LIFE SCIENCES 2014; 57:1140-8. [PMID: 25326069 DOI: 10.1007/s11427-014-4752-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 08/25/2014] [Indexed: 10/24/2022]
Abstract
Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.
Collapse
|
33
|
Abstract
Traditionally, microbial genome sequencing has been restricted to the small number of species that can be grown in pure culture. The progressive development of culture-independent methods over the last 15 years now allows researchers to sequence microbial communities directly from environmental samples. This approach is commonly referred to as "metagenomics" or "community genomics". However, the term metagenomics is applied liberally in the literature to describe any culture-independent analysis of microbial communities. Here, we define metagenomics as shotgun ("random") sequencing of the genomic DNA of a sample taken directly from the environment. The metagenome can be thought of as a sampling of the collective genome of the microbial community. We outline the considerations and analyses that should be undertaken to ensure the success of a metagenomic sequencing project, including the choice of sequencing platform and methods for assembly, binning, annotation, and comparative analysis.
Collapse
Affiliation(s)
- Lauren Bragg
- Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD, Australia
| | | |
Collapse
|
34
|
Bao E, Jiang T, Girke T. AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics 2014; 30:i319-i328. [PMID: 24932000 PMCID: PMC4058956 DOI: 10.1093/bioinformatics/btu291] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation:De novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species. Results: Here we introduce AlignGraph, an algorithm for extending and joining de novo-assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and preassembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the PE multipositional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7–62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9–94.5% and 80.3–165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph’s efficiency in improving genome assemblies by taking advantage of closely related references. Availability and implementation: The AlignGraph software can be downloaded for free from this site: https://github.com/baoe/AlignGraph. Contact:thomas.girke@ucr.edu
Collapse
Affiliation(s)
- Ergude Bao
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Tao Jiang
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Thomas Girke
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| |
Collapse
|
35
|
Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics 2014; 15:302. [PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 08/01/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Bahlul Haider
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Michael Molnar
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Roberto Solis-Oba
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| |
Collapse
|
36
|
Sze SH, Tarone AM. A memory-efficient algorithm to obtain splicing graphs and de novo expression estimates from de Bruijn graphs of RNA-Seq data. BMC Genomics 2014; 15 Suppl 5:S6. [PMID: 25082000 PMCID: PMC4120145 DOI: 10.1186/1471-2164-15-s5-s6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background The recent advance of high-throughput sequencing makes it feasible to study entire transcriptomes through the application of de novo sequence assembly algorithms. While a popular strategy is to first construct an intermediate de Bruijn graph structure to represent the transcriptome, an additional step is needed to construct predicted transcripts from the graph. Results Since the de Bruijn graph contains all branching possibilities, we develop a memory-efficient algorithm to recover alternative splicing information and library-specific expression information directly from the graph without prior genomic knowledge. We implement the algorithm as a postprocessing module of the Velvet assembler. We validate our algorithm by simulating the transcriptome assembly of Drosophila using its known genome, and by performing Drosophila transcriptome assembly using publicly available RNA-Seq libraries. Under a range of conditions, our algorithm recovers sequences and alternative splicing junctions with higher specificity than Oases or Trans-ABySS. Conclusions Since our postprocessing algorithm does not consume as much memory as Velvet and is less memory-intensive than Oases, it allows biologists to assemble large libraries with limited computational resources. Our algorithm has been applied to perform transcriptome assembly of the non-model blow fly Lucilia sericata that was reported in a previous article, which shows that the assembly is of high quality and it facilitates comparison of the Lucilia sericata transcriptome to Drosophila and two mosquitoes, prediction and experimental validation of alternative splicing, investigation of differential expression among various developmental stages, and identification of transposable elements.
Collapse
|
37
|
Genomics and Proteomics of Foodborne Microorganisms. Food Microbiol 2014. [DOI: 10.1128/9781555818463.ch39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
38
|
Reddy RM, Mohammed MH, Mande SS. MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 2014; 103:161-8. [DOI: 10.1016/j.ygeno.2014.02.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Revised: 02/18/2014] [Accepted: 02/24/2014] [Indexed: 10/25/2022]
|
39
|
El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
40
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
41
|
Ramos RTJ, Carneiro AR, Caracciolo PH, Azevedo V, Schneider MPC, Barh D, Silva A. Graphical contig analyzer for all sequencing platforms (G4ALL): a new stand-alone tool for finishing and draft generation of bacterial genomes. Bioinformation 2013; 9:599-604. [PMID: 23888102 PMCID: PMC3717189 DOI: 10.6026/97320630009599] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2013] [Accepted: 05/27/2013] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Genome assembly has always been complicated due to the inherent difficulties of sequencing technologies, as well the computational methods used to process sequences. Although many of the problems for the generation of contigs from reads are well known, especially those involving short reads, the orientation and ordination of contigs in the finishing stages is still very challenging and time consuming, as it requires the manual curation of the contigs to guarantee correct identification them and prevent misassembly. Due to the large numbers of sequences that are produced, especially from the reads produced by next generation sequencers, this process demands considerable manual effort, and there are few software options available to facilitate the process. To address this problem, we have developed the Graphic Contig Analyzer for All Sequencing Platforms (G4ALL): a stand-alone multi-user tool that facilitates the editing of the contigs produced in the assembly process. Besides providing information on the gene products contained in each contig, obtained through a search of the available biological databases, G4ALL produces a scaffold of the genome, based on the overlap of the contigs after curation. AVAILABILITY THE SOFTWARE IS AVAILABLE AT: http://www.genoma.ufpa.br/rramos/softwares/g4all.xhtml.
Collapse
|
42
|
Chu HT, Hsiao WWL, Tsao TTH, Hsu DF, Chen CC, Lee SA, Kao CY. SeqEntropy: genome-wide assessment of repeats for short read sequencing. PLoS One 2013; 8:e59484. [PMID: 23544073 PMCID: PMC3609794 DOI: 10.1371/journal.pone.0059484] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2012] [Accepted: 02/14/2013] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPAL FINDINGS We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures. CONCLUSIONS/SIGNIFICANCE The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.
Collapse
Affiliation(s)
- Hsueh-Ting Chu
- Department of Biomedical informatics, Asia University, Taichung, Taiwan
- Department of Computer Science and Information Engineering, Asia University, Taichung, Taiwan
| | - William WL. Hsiao
- British Columbia Public Health Microbiology and Reference Laboratory, Vancouver, British Columbia, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Theresa TH. Tsao
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
| | - D. Frank Hsu
- Department of Computer and Information Science, Fordham University, New York, New York, United States of America
| | - Chaur-Chin Chen
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Sheng-An Lee
- Department of Information Management, Kainan University, Taoyuan, Taiwan
| | - Cheng-Yan Kao
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
43
|
Fu N, Wang Q, Shen HL. De novo assembly, gene annotation and marker development using Illumina paired-end transcriptome sequences in celery (Apium graveolens L.). PLoS One 2013; 8:e57686. [PMID: 23469050 PMCID: PMC3585167 DOI: 10.1371/journal.pone.0057686] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 01/23/2013] [Indexed: 12/11/2022] Open
Abstract
Background Celery is an increasing popular vegetable species, but limited transcriptome and genomic data hinder the research to it. In addition, a lack of celery molecular markers limits the process of molecular genetic breeding. High-throughput transcriptome sequencing is an efficient method to generate a large transcriptome sequence dataset for gene discovery, molecular marker development and marker-assisted selection breeding. Principal Findings Celery transcriptomes from four tissues were sequenced using Illumina paired-end sequencing technology. De novo assembling was performed to generate a collection of 42,280 unigenes (average length of 502.6 bp) that represent the first transcriptome of the species. 78.43% and 48.93% of the unigenes had significant similarity with proteins in the National Center for Biotechnology Information (NCBI) non-redundant protein database (Nr) and Swiss-Prot database respectively, and 10,473 (24.77%) unigenes were assigned to Clusters of Orthologous Groups (COG). 21,126 (49.97%) unigenes harboring Interpro domains were annotated, in which 15,409 (36.45%) were assigned to Gene Ontology(GO) categories. Additionally, 7,478 unigenes were mapped onto 228 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). Large numbers of simple sequence repeats (SSRs) were indentified, and then the rate of successful amplication and polymorphism were investigated among 31 celery accessions. Conclusions This study demonstrates the feasibility of generating a large scale of sequence information by Illumina paired-end sequencing and efficient assembling. Our results provide a valuable resource for celery research. The developed molecular markers are the foundation of further genetic linkage analysis and gene localization, and they will be essential to accelerate the process of breeding.
Collapse
Affiliation(s)
- Nan Fu
- College of Agriculture and Biotechnology, China Agricultural University, Beijing, China
| | - Qian Wang
- College of Agriculture and Biotechnology, China Agricultural University, Beijing, China
| | - Huo-Lin Shen
- College of Agriculture and Biotechnology, China Agricultural University, Beijing, China
- * E-mail:
| |
Collapse
|
44
|
Rahman A, Pachter L. CGAL: computing genome assembly likelihoods. Genome Biol 2013; 14:R8. [PMID: 23360652 PMCID: PMC3663106 DOI: 10.1186/gb-2013-14-1-r8] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2012] [Accepted: 01/29/2013] [Indexed: 01/12/2023] Open
Abstract
Assembly algorithms have been extensively benchmarked using simulated data so that results can be compared to ground truth. However, in de novo assembly, only crude metrics such as contig number and size are typically used to evaluate assembly quality. We present CGAL, a novel likelihood-based approach to assembly assessment in the absence of a ground truth. We show that likelihood is more accurate than other metrics currently used for evaluating assemblies, and describe its application to the optimization and comparison of assembly algorithms. Our methods are implemented in software that is freely available at http://bio.math.berkeley.edu/cgal/.
Collapse
|
45
|
Luo C, Rodriguez-R LM, Konstantinidis KT. A user's guide to quantitative and comparative analysis of metagenomic datasets. Methods Enzymol 2013; 531:525-47. [PMID: 24060135 DOI: 10.1016/b978-0-12-407863-5.00023-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Metagenomics has revolutionized microbiological studies during the past decade and provided new insights into the diversity, dynamics, and metabolic potential of natural microbial communities. However, metagenomics still represents a field in development, and standardized tools and approaches to handle and compare metagenomes have not been established yet. An important reason accounting for the latter is the continuous changes in the type of sequencing data available, for example, long versus short sequencing reads. Here, we provide a guide to bioinformatic pipelines developed to accomplish the following tasks, focusing primarily on those developed by our team: (i) assemble a metagenomic dataset; (ii) determine the level of sequence coverage obtained and the amount of sequencing required to obtain complete coverage; (iii) identify the taxonomic affiliation of a metagenomic read or assembled contig; and (iv) determine differentially abundant genes, pathways, and species between different datasets. Most of these pipelines do not depend on the type of sequences available or can be easily adjusted to fit different types of sequences, and are freely available (for instance, through our lab Web site: http://www.enve-omics.gatech.edu/). The limitations of current approaches, as well as the computational aspects that can be further improved, will also be briefly discussed. The work presented here provides practical guidelines on how to perform metagenomic analysis of microbial communities characterized by varied levels of diversity and establishes approaches to handle the resulting data, independent of the sequencing platform employed.
Collapse
Affiliation(s)
- Chengwei Luo
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, Georgia, USA; School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA; School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | | | | |
Collapse
|
46
|
Lu M, Luo Q, Wang B, Wu J, Zhao J. GPU-Accelerated Bidirected De Bruijn Graph Construction for Genome Assembly. WEB TECHNOLOGIES AND APPLICATIONS 2013. [DOI: 10.1007/978-3-642-37401-2_8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
47
|
Wang Y, Yu Y, Pan B, Hao P, Li Y, Shao Z, Xu X, Li X. Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S21. [PMID: 23282199 PMCID: PMC3524012 DOI: 10.1186/1752-0509-6-s3-s21] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Background Sequencing of bacterial genomes became an essential approach to study pathogen virulence and the phylogenetic relationship among close related strains. Bacterium Enterococcus faecium emerged as an important nosocomial pathogen that were often associated with resistance to common antibiotics in hospitals. With highly divergent gene contents, it presented a challenge to the next generation sequencing (NGS) technologies featuring high-throughput and shorter read-length. This study was designed to investigate the properties and systematic biases of NGS technologies and evaluate critical parameters influencing the outcomes of hybrid assemblies using combinations of NGS data. Results A hospital strain of E. faecium was sequenced using three different NGS platforms: 454 GS-FLX, Illumina GAIIx, and ABI SOLiD4.0, to approximately 28-, 500-, and 400-fold coverage depth. We built a pipeline that merged contigs from each NGS data into hybrid assemblies. The results revealed that each single NGS assembly had a ceiling in continuity that could not be overcome by simply increasing data coverage depth. Each NGS technology displayed some intrinsic properties, i.e. base calling error, systematic bias, etc. The gaps and low coverage regions of each NGS assembly were associated with lower GC contents. In order to optimize the hybrid assembly approach, we tested with varying amount and different combination of NGS data, and obtained optimal conditions for assembly continuity. We also, for the first time, showed that SOLiD data could help make much improved assemblies of E. faecium genome using the hybrid approach when combined with other type of NGS data. Conclusions The current study addressed the difficult issue of how to most effectively construct a complete microbial genome using today's state of the art sequencing technologies. We characterized the sequence data and genome assembly from each NGS technologies, tested conditions for hybrid assembly with combinations of NGS data, and obtained optimized parameters for achieving most cost-efficiency assembly. Our study helped form some guidelines to direct genomic work on other microorganisms, thus have important practical implications.
Collapse
Affiliation(s)
- Yajun Wang
- Shanghai Center for Systems Biomedicine, Shanghai Jiaotong University, Shanghai 200240, China
| | | | | | | | | | | | | | | |
Collapse
|
48
|
|
49
|
Geng C, Chen Y, Wu K, Cai Q, Wang Y, Lang Y, Cao H, Yang H, Wang J, Zhang X. Paired-end sequencing of long-range DNA fragments for de novo assembly of large, complex Mammalian genomes by direct intra-molecule ligation. PLoS One 2012; 7:e46211. [PMID: 23029438 PMCID: PMC3459883 DOI: 10.1371/journal.pone.0046211] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2012] [Accepted: 08/28/2012] [Indexed: 11/19/2022] Open
Abstract
Background The relatively short read lengths from next generation sequencing (NGS) technologies still pose a challenge for de novo assembly of complex mammal genomes. One important solution is to use paired-end (PE) sequence information experimentally obtained from long-range DNA fragments (>1 kb). Here, we characterize and extend a long-range PE library construction method based on direct intra-molecule ligation (or molecular linker-free circularization) for NGS. Results We found that the method performs stably for PE sequencing of 2- to 5- kb DNA fragments, and can be extended to 10–20 kb (and even in extremes, up to ∼35 kb). We also characterized the impact of low quality input DNA on the method, and develop a whole-genome amplification (WGA) based protocol using limited input DNA (<1 µg). Using this PE dataset, we accurately assembled the YanHuang (YH) genome, the first sequenced Asian genome, into a scaffold N50 size of >2 Mb, which is over100-times greater than the initial size produced with only small insert PE reads(17 kb). In addition, we mapped two 7- to 8- kb insertions in the YH genome using the larger insert sizes of the long-range PE data. Conclusions In conclusion, we demonstrate here the effectiveness of this long-range PE sequencing method and its use for the de novo assembly of a large, complex genome using NGS short reads.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Jian Wang
- BGI-Shenzhen, Shenzhen, Guangdong, China
- * E-mail: (JW); (XZ)
| | - Xiuqing Zhang
- BGI-Shenzhen, Shenzhen, Guangdong, China
- * E-mail: (JW); (XZ)
| |
Collapse
|
50
|
Tanaseichuk O, Borneman J, Jiang T. Separating metagenomic short reads into genomes via clustering. Algorithms Mol Biol 2012; 7:27. [PMID: 23009059 PMCID: PMC3537596 DOI: 10.1186/1748-7188-7-27] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Accepted: 09/14/2012] [Indexed: 01/21/2023] Open
Abstract
Background The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels. Results In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at
http://www.cs.ucr.edu/∼tanaseio/toss.htm. Conclusions Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.
Collapse
|