101
|
Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine. Pharmaceutics 2016; 8:E15. [PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 03/11/2016] [Accepted: 04/06/2016] [Indexed: 01/15/2023] Open
Abstract
Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.
Collapse
Affiliation(s)
- Wenming Xiao
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Leihong Wu
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Gokhan Yavas
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
| | - Baitang Ning
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
102
|
Huang KW, Chen JL, Yang CS, Tsai CW. A memetic gravitation search algorithm for solving DNA fragment assembly problems. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/ifs-151994] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ko-Wei Huang
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
- Department of Psychology, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Jui-Le Chen
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
- Department of Computer Science and Entertainment Technology, Tajen university, Pingtung, Taiwan, R.O.C
| | - Chu-Sing Yang
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Chun-Wei Tsai
- Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan, R.O.C
| |
Collapse
|
103
|
Moreton J, Izquierdo A, Emes RD. Assembly, Assessment, and Availability of De novo Generated Eukaryotic Transcriptomes. Front Genet 2016; 6:361. [PMID: 26793234 PMCID: PMC4707302 DOI: 10.3389/fgene.2015.00361] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2015] [Accepted: 12/19/2015] [Indexed: 11/13/2022] Open
Abstract
De novo assembly of a complete transcriptome without the need for a guiding reference genome is attractive, particularly where the cost and complexity of generating a eukaryote genome is prohibitive. The transcriptome should not however be seen as just a quick and cheap alternative to building a complete genome. Transcriptomics allows the understanding and comparison of spatial and temporal samples within an organism, and allows surveying of multiple individuals or closely related species. De novo assembly in theory allows the building of a complete transcriptome without any prior knowledge of the genome. It also allows the discovery of alternate splice forms of coding RNAs and also non-coding RNAs, which are often missed by proteomic approaches, or are incompletely annotated in genome studies. The limitations of the method are that the generation of a truly complete assembly is unlikely, and so we require some methods for the assessment of the quality and appropriateness of a generated transcriptome. Whilst no single consensus pipeline or tool is agreed as optimal, various algorithms, and easy to use software do exist making transcriptome generation a more common approach. With this expansion of data, questions still exist relating to how do we make these datasets fully discoverable, comparable and most useful to understand complex biological systems?
Collapse
Affiliation(s)
- Joanna Moreton
- Advanced Data Analysis Centre, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
- School of Veterinary Medicine and Science, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
| | - Abril Izquierdo
- School of Veterinary Medicine and Science, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
| | - Richard D. Emes
- Advanced Data Analysis Centre, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
- School of Veterinary Medicine and Science, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
| |
Collapse
|
104
|
Agrawal S, Ganley ARD. Complete Sequence Construction of the Highly Repetitive Ribosomal RNA Gene Repeats in Eukaryotes Using Whole Genome Sequence Data. Methods Mol Biol 2016; 1455:161-181. [PMID: 27576718 DOI: 10.1007/978-1-4939-3792-9_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The ribosomal RNA genes (rDNA) encode the major rRNA species of the ribosome, and thus are essential across life. These genes are highly repetitive in most eukaryotes, forming blocks of tandem repeats that form the core of nucleoli. The primary role of the rDNA in encoding rRNA has been long understood, but more recently the rDNA has been implicated in a number of other important biological phenomena, including genome stability, cell cycle, and epigenetic silencing. Noncoding elements, primarily located in the intergenic spacer region, appear to mediate many of these phenomena. Although sequence information is available for the genomes of many organisms, in almost all cases rDNA repeat sequences are lacking, primarily due to problems in assembling these intriguing regions during whole genome assemblies. Here, we present a method to obtain complete rDNA repeat unit sequences from whole genome assemblies. Limitations of next generation sequencing (NGS) data make them unsuitable for assembling complete rDNA unit sequences; therefore, the method we present relies on the use of Sanger whole genome sequence data. Our method makes use of the Arachne assembler, which can assemble highly repetitive regions such as the rDNA in a memory-efficient way. We provide a detailed step-by-step protocol for generating rDNA sequences from whole genome Sanger sequence data using Arachne, for refining complete rDNA unit sequences, and for validating the sequences obtained. In principle, our method will work for any species where the rDNA is organized into tandem repeats. This will help researchers working on species without a complete rDNA sequence, those working on evolutionary aspects of the rDNA, and those interested in conducting phylogenetic footprinting studies with the rDNA.
Collapse
Affiliation(s)
- Saumya Agrawal
- Institute of Natural and Mathematical Sciences, Massey University, Private Bag 102-904, Auckland, 0632, New Zealand.
- School of Biological Sciences, University of Auckland, Auckland, New Zealand.
| | - Austen R D Ganley
- Institute of Natural and Mathematical Sciences, Massey University, Private Bag 102-904, Auckland, 0632, New Zealand.
- School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, 1142, New Zealand.
| |
Collapse
|
105
|
Abstract
Recent improvements in next-generation sequencing technology have made it possible to do whole genome sequencing, on even non-model eukaryote species with no available reference genomes. However, de novo assembly of diploid genomes is still a big challenge because of allelic variation. The aim of this study was to determine the feasibility of utilizing the genome of haploid fish larvae for de novo assembly of whole-genome sequences. We compared the efficiency of assembly using the haploid genome of yellowtail (Seriola quinqueradiata) with that using the diploid genome obtained from the dam. De novo assembly from the haploid and the diploid sequence reads (100 million reads per each datasets) generated by the Ion Proton sequencer (200 bp) was done under two different assembly algorithms, namely overlap-layout-consensus (OLC) and de Bruijn graph (DBG). This revealed that the assembly of the haploid genome significantly reduced (approximately 22% for OLC, 9% for DBG) the total number of contigs (with longer average and N50 contig lengths) when compared to the diploid genome assembly. The haploid assembly also improved the quality of the scaffolds by reducing the number of regions with unassigned nucleotides (Ns) (total length of Ns; 45,331,916 bp for haploids and 67,724,360 bp for diploids) in OLC-based assemblies. It appears clear that the haploid genome assembly is better because the allelic variation in the diploid genome disrupts the extension of contigs during the assembly process. Our results indicate that utilizing the genome of haploid larvae leads to a significant improvement in the de novo assembly process, thus providing a novel strategy for the construction of reference genomes from non-model diploid organisms such as fish.
Collapse
|
106
|
Möbius P, Hölzer M, Felder M, Nordsiek G, Groth M, Köhler H, Reichwald K, Platzer M, Marz M. Comprehensive insights in the Mycobacterium avium subsp. paratuberculosis genome using new WGS data of sheep strain JIII-386 from Germany. Genome Biol Evol 2015; 7:2585-2601. [PMID: 26384038 PMCID: PMC4607514 DOI: 10.1093/gbe/evv154] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Mycobacterium avium (M. a.) subsp. paratuberculosis (MAP)—the etiologic agent of Johne’s disease—affects cattle, sheep, and other ruminants worldwide. To decipher phenotypic differences among sheep and cattle strains (belonging to MAP-S [Type-I/III], respectively, MAP-C [Type-II]), comparative genome analysis needs data from diverse isolates originating from different geographic regions of the world. This study presents the so far best assembled genome of a MAP-S-strain: Sheep isolate JIII-386 from Germany. One newly sequenced cattle isolate (JII-1961, Germany), four published MAP strains of MAP-C and MAP-S from the United States and Australia, and M. a. subsp. hominissuis (MAH) strain 104 were used for assembly improvement and comparisons. All genomes were annotated by BacProt and results compared with NCBI (National Center for Biotechnology Information) annotation. Corresponding protein-coding sequences (CDSs) were detected, but also CDSs that were exclusively determined by either NCBI or BacProt. A new Shine–Dalgarno sequence motif (5′-AGCTGG-3′) was extracted. Novel CDSs including PE-PGRS family protein genes and about 80 noncoding RNAs exhibiting high sequence conservation are presented. Previously found genetic differences between MAP-types are partially revised. Four of ten assumed MAP-S-specific large sequence polymorphism regions (LSPSs) are still present in MAP-C strains; new LSPSs were identified. Independently of the regional origin of the strains, the number of individual CDSs and single nucleotide variants confirms the strong similarity of MAP-C strains and shows higher diversity among MAP-S strains. This study gives ambiguous results regarding the hypothesis that MAP-S is the evolutionary intermediate between MAH and MAP-C, but it clearly shows a higher similarity of MAP to MAH than to Mycobacterium intracellulare.
Collapse
Affiliation(s)
- Petra Möbius
- NRL for Paratuberculosis, Institute of Molecular Pathogenesis, Friedrich-Loeffler-Institut (Federal Research Institute for Animal Health), Naumburger Straße 96a, 07743 Jena, Germany
| | - Martin Hölzer
- RNA Bioinformatics and High Throughput Analysis, Faculty of Mathematics and Computer Science, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Marius Felder
- Leibniz Institute for Age Research - Fritz-Lipmann-Institute (FLI), Beutenbergstraße 11, 07745 Jena, Germany
| | - Gabriele Nordsiek
- Department of Genome Analysis, Helmholtz Centre for Infection Research, Inhoffenstr. 7, 38124 Braunschweig, Germany
| | - Marco Groth
- Leibniz Institute for Age Research - Fritz-Lipmann-Institute (FLI), Beutenbergstraße 11, 07745 Jena, Germany
| | - Heike Köhler
- NRL for Paratuberculosis, Institute of Molecular Pathogenesis, Friedrich-Loeffler-Institut (Federal Research Institute for Animal Health), Naumburger Straße 96a, 07743 Jena, Germany
| | - Kathrin Reichwald
- Leibniz Institute for Age Research - Fritz-Lipmann-Institute (FLI), Beutenbergstraße 11, 07745 Jena, Germany
| | - Matthias Platzer
- Leibniz Institute for Age Research - Fritz-Lipmann-Institute (FLI), Beutenbergstraße 11, 07745 Jena, Germany
| | - Manja Marz
- RNA Bioinformatics and High Throughput Analysis, Faculty of Mathematics and Computer Science, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| |
Collapse
|
107
|
García-López R, Vázquez-Castellanos JF, Moya A. Fragmentation and Coverage Variation in Viral Metagenome Assemblies, and Their Effect in Diversity Calculations. Front Bioeng Biotechnol 2015; 3:141. [PMID: 26442255 PMCID: PMC4585024 DOI: 10.3389/fbioe.2015.00141] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Accepted: 09/03/2015] [Indexed: 01/01/2023] Open
Abstract
Metagenomic libraries consist of DNA fragments from diverse species, with varying genome size and abundance. High-throughput sequencing platforms produce large volumes of reads from these libraries, which may be assembled into contigs, ideally resembling the original larger genomic sequences. The uneven species distribution, along with the stochasticity in sample processing and sequencing bias, impacts the success of accurate sequence assembly. Several assemblers enable the processing of viral metagenomic data de novo, generally using overlap layout consensus or de Bruijn graph approaches for contig assembly. The success of viral genomic reconstruction in these datasets is limited by the degree of fragmentation of each genome in the sample, which is dependent on the sequencing effort and the genome length. Depending on ecological, biological, or procedural biases, some fragments have a higher prevalence, or coverage, in the assembly. However, assemblers must face challenges, such as the formation of chimerical structures and intra-species variability. Diversity calculation relies on the classification of the sequences that comprise a metagenomic dataset. Whenever the corresponding genomic and taxonomic information is available, contigs matching the same species can be classified accordingly and the coverage of its genome can be calculated for that species. This may be used to compare populations by estimating abundance and assessing species distribution from this data. Nevertheless, the coverage does not take into account the degree of fragmentation, or else genome completeness, and is not necessarily representative of actual species distribution in the samples. Furthermore, undetermined sequences are abundant in viral metagenomic datasets, resulting in several independent contigs that cannot be assigned by homology or genomic information. These may only be classified as different operational taxonomic units (OTUs), sometimes remaining inadvisably unrelated. Thus, calculations using contigs as different OTUs ultimately overestimate diversity when compared to diversity calculated from species coverage. In order to compare the effect of coverage and fragmentation, we generated three sets of simulated Illumina paired-end reads with different sequencing depths. We compared different assemblies performed with RayMeta, CLC Assembly Cell, MEGAHIT, SPAdes, Meta-IDBA, SOAPdenovo, Velvet, Metavelvet, and MIRA with the best attainable assemblies for each dataset (formed by arranging data using known genome coordinates) by calculating different assembly statistics. A new fragmentation score was included to estimate the degree of genome fragmentation of each taxon and adjust the coverage accordingly. The abundance in the metagenome was compared by bootstrapping the assembly data and hierarchically clustering them with the best possible assembly. Additionally, richness and diversity indexes were calculated for all the resulting assemblies and were assessed under two distributions: contigs as independent OTUs and sequences classified by species. Finally, we search for the strongest correlations between the diversity indexes and the different assembly statistics. Although fragmentation was dependent of genome coverage, it was not as heavily influenced by the assembler. The sequencing depth was the predominant attractor that influenced the success of the assemblies. The coverage increased notoriously in larger datasets, whereas fragmentation values remained lower and unsaturated. While still far from obtaining the ideal assemblies, the RayMeta, SPAdes, and the CLC assemblers managed to build the most accurate contigs with larger datasets while Meta-IDBA showed a good performance with the medium-sized dataset, even after the adjusted coverage was calculated. Their resulting assemblies showed the highest coverage scores and the lowest fragmentation values. Alpha diversity calculated from contigs as OTUs resulted in significantly higher values for all assemblies when compared with actual species distribution, showing an overestimation due to the increased predicted abundance. Conversely, using PHACCS resulted in lower values for all assemblers. Different association methods (random-forest, generalized linear models, and the Spearman correlation index) support the number of contigs, the coverage, and fragmentation as the assembly parameters that most affect the estimation of the alpha diversity. Coverage calculations may provide an insight into relative completeness of a genome but they overlook missing fragments or overly separated sequences in a genome. The assembly of a highly fragmented genomes with high coverage may still lead to the clustering of different OTUs that are actually different fragments of a genome. Thus, it proves useful to penalize coverage with a fragmentation score. Using contigs for calculating alpha diversity result in overestimation but it is usually the only approach available. Still, it is enough for sample comparison. The best approach may be determined by choosing the assembler that better fits the sequencing depth and adjusting the parameters for longer accurate contigs whenever possible whereas diversity may be calculated considering taxonomical and genomic information if available.
Collapse
Affiliation(s)
- Rodrigo García-López
- Área de Genómica y Salud, Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunidad Valenciana (FISABIO)-Salud Pública , Valencia , Spain ; Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València , Paterna , Spain ; Consorcio de Investigación Biomédica en Red especializado en Epidemiología y Salud Pública (CIBERESP) , Madrid , Spain
| | - Jorge Francisco Vázquez-Castellanos
- Área de Genómica y Salud, Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunidad Valenciana (FISABIO)-Salud Pública , Valencia , Spain ; Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València , Paterna , Spain ; Consorcio de Investigación Biomédica en Red especializado en Epidemiología y Salud Pública (CIBERESP) , Madrid , Spain
| | - Andrés Moya
- Área de Genómica y Salud, Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunidad Valenciana (FISABIO)-Salud Pública , Valencia , Spain ; Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València , Paterna , Spain ; Consorcio de Investigación Biomédica en Red especializado en Epidemiología y Salud Pública (CIBERESP) , Madrid , Spain
| |
Collapse
|
108
|
Abstract
Background Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking. Methods We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups. Results We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads). Conclusion More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.
Collapse
|
109
|
Wang W, Feng B, Xiao J, Xia Z, Zhou X, Li P, Zhang W, Wang Y, Møller BL, Zhang P, Luo MC, Xiao G, Liu J, Yang J, Chen S, Rabinowicz PD, Chen X, Zhang HB, Ceballos H, Lou Q, Zou M, Carvalho LJCB, Zeng C, Xia J, Sun S, Fu Y, Wang H, Lu C, Ruan M, Zhou S, Wu Z, Liu H, Kannangara RM, Jørgensen K, Neale RL, Bonde M, Heinz N, Zhu W, Wang S, Zhang Y, Pan K, Wen M, Ma PA, Li Z, Hu M, Liao W, Hu W, Zhang S, Pei J, Guo A, Guo J, Zhang J, Zhang Z, Ye J, Ou W, Ma Y, Liu X, Tallon LJ, Galens K, Ott S, Huang J, Xue J, An F, Yao Q, Lu X, Fregene M, López-Lavalle LAB, Wu J, You FM, Chen M, Hu S, Wu G, Zhong S, Ling P, Chen Y, Wang Q, Liu G, Liu B, Li K, Peng M. Cassava genome from a wild ancestor to cultivated varieties. Nat Commun 2014; 5:5110. [PMID: 25300236 PMCID: PMC4214410 DOI: 10.1038/ncomms6110] [Citation(s) in RCA: 159] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2014] [Accepted: 08/27/2014] [Indexed: 11/10/2022] Open
Abstract
Cassava is a major tropical food crop in the Euphorbiaceae family that has high carbohydrate production potential and adaptability to diverse environments. Here we present the draft genome sequences of a wild ancestor and a domesticated variety of cassava and comparative analyses with a partial inbred line. We identify 1,584 and 1,678 gene models specific to the wild and domesticated varieties, respectively, and discover high heterozygosity and millions of single-nucleotide variations. Our analyses reveal that genes involved in photosynthesis, starch accumulation and abiotic stresses have been positively selected, whereas those involved in cell wall biosynthesis and secondary metabolism, including cyanogenic glucoside formation, have been negatively selected in the cultivated varieties, reflecting the result of natural selection and domestication. Differences in microRNA genes and retrotransposon regulation could partly explain an increased carbon flux towards starch accumulation and reduced cyanogenic glucoside accumulation in domesticated cassava. These results may contribute to genetic improvement of cassava through better understanding of its biology.
Collapse
Affiliation(s)
- Wenquan Wang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Binxiao Feng
- 1] Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China [2] Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Jingfa Xiao
- Beijing Institute of Genomics, Chinese Academy of Sciences (CAS), Beijing 100101, China
| | - Zhiqiang Xia
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Xincheng Zhou
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Pinghua Li
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Weixiong Zhang
- 1] Department of Computer Science and Engineering and Department of Genetics, Washington University, Saint Louis, Missouri 63130, USA [2] Institute for Systems Biology, Jianghan University, Wuhan 430056, China
| | - Ying Wang
- South China Botanical Garden, CAS, Guangzhou 510650, China
| | - Birger Lindberg Møller
- Plant Biochemistry Laboratory, Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen 1165, Denmark
| | - Peng Zhang
- Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences of CAS, Shanghai 200032, China
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Gong Xiao
- South China Botanical Garden, CAS, Guangzhou 510650, China
| | - Jingxing Liu
- Beijing Institute of Genomics, Chinese Academy of Sciences (CAS), Beijing 100101, China
| | - Jun Yang
- Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences of CAS, Shanghai 200032, China
| | - Songbi Chen
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Pablo D Rabinowicz
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Xin Chen
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Hong-Bin Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas 77843, USA
| | - Henan Ceballos
- International Center for Tropical Agriculture (CIAT), Cali 6713, Colombia
| | - Qunfeng Lou
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, College of Horticulture, Nanjing Agricultural University, Nanjing 210095, China
| | - Meiling Zou
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Luiz J C B Carvalho
- Brazilian Enterprise for Agricultural Research (EMBRAPA), Genetic Resources and Biotechnology, Brasilia 70770, Brazil
| | - Changying Zeng
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Jing Xia
- 1] Department of Computer Science and Engineering and Department of Genetics, Washington University, Saint Louis, Missouri 63130, USA [2] Institute for Systems Biology, Jianghan University, Wuhan 430056, China
| | - Shixiang Sun
- Beijing Institute of Genomics, Chinese Academy of Sciences (CAS), Beijing 100101, China
| | - Yuhua Fu
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Haiyan Wang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Cheng Lu
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Mengbin Ruan
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Shuigeng Zhou
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China
| | - Zhicheng Wu
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China
| | - Hui Liu
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China
| | - Rubini Maya Kannangara
- Plant Biochemistry Laboratory, Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen 1165, Denmark
| | - Kirsten Jørgensen
- Plant Biochemistry Laboratory, Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen 1165, Denmark
| | - Rebecca Louise Neale
- Plant Biochemistry Laboratory, Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen 1165, Denmark
| | - Maya Bonde
- Plant Biochemistry Laboratory, Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen 1165, Denmark
| | - Nanna Heinz
- Plant Biochemistry Laboratory, Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen 1165, Denmark
| | - Wenli Zhu
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Shujuan Wang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Yang Zhang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Kun Pan
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Mingfu Wen
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Ping-An Ma
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Zhengxu Li
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Meizhen Hu
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Wenbin Liao
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Wenbin Hu
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Shengkui Zhang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Jinli Pei
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Anping Guo
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Jianchun Guo
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Jiaming Zhang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Zhengwen Zhang
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Jianqiu Ye
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Wenjun Ou
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Yaqin Ma
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Xinyue Liu
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Luke J Tallon
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Kevin Galens
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Sandra Ott
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Jie Huang
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Jingjing Xue
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Feifei An
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Qingqun Yao
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Xiaojing Lu
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Martin Fregene
- International Center for Tropical Agriculture (CIAT), Cali 6713, Colombia
| | | | - Jiajie Wu
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Frank M You
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Meili Chen
- Beijing Institute of Genomics, Chinese Academy of Sciences (CAS), Beijing 100101, China
| | - Songnian Hu
- Beijing Institute of Genomics, Chinese Academy of Sciences (CAS), Beijing 100101, China
| | - Guojiang Wu
- South China Botanical Garden, CAS, Guangzhou 510650, China
| | - Silin Zhong
- State Key Laboratory of Agrobiotechnology, School of Life Sciences, Chinese University of Hong Kong, Hong Kong, China
| | - Peng Ling
- Citrus Research and Education Center (CREC), University of Florida, Gainesville, Florida 32611, USA
| | - Yeyuan Chen
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Qinghuang Wang
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| | - Guodao Liu
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Bin Liu
- State Key Laboratory of Desert and Oasis Ecology, Key Laboratory of Biogeography and Bioresources in Arid Land, Center of Systematic Genomics, Xinjiang Institute of Ecology and Geography, Urumqi 830011, China
| | - Kaimian Li
- Tropical Crop Genetic Resources Institute, CATAS, Danzhou 571700, China
| | - Ming Peng
- Institute of Tropical Biosciences and Biotechnology, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou 571101, China
| |
Collapse
|
110
|
Jünemann S, Prior K, Albersmeier A, Albaum S, Kalinowski J, Goesmann A, Stoye J, Harmsen D. GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers. PLoS One 2014; 9:e107014. [PMID: 25198770 PMCID: PMC4157817 DOI: 10.1371/journal.pone.0107014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/07/2014] [Indexed: 12/28/2022] Open
Abstract
De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM), popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely available at ftp://ftp.cebitec.uni-bielefeld.de/pub/GABenchToB.
Collapse
Affiliation(s)
- Sebastian Jünemann
- Department for Periodontology, University of Münster, Münster, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Karola Prior
- Department for Periodontology, University of Münster, Münster, Germany
| | - Andreas Albersmeier
- Technology Platform Genomics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Stefan Albaum
- Bioinformatics Resource Facility, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jörn Kalinowski
- Technology Platform Genomics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-Univeristy Gießen, Gießen, Germany
| | - Jens Stoye
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Genome Informatics Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Dag Harmsen
- Department for Periodontology, University of Münster, Münster, Germany
| |
Collapse
|
111
|
Huang KW, Chen JL, Yang CS, Tsai CW. A memetic particle swarm optimization algorithm for solving the DNA fragment assembly problem. Neural Comput Appl 2014. [DOI: 10.1007/s00521-014-1659-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
112
|
Sijmons S, Thys K, Corthout M, Van Damme E, Van Loock M, Bollen S, Baguet S, Aerssens J, Van Ranst M, Maes P. A method enabling high-throughput sequencing of human cytomegalovirus complete genomes from clinical isolates. PLoS One 2014; 9:e95501. [PMID: 24755734 PMCID: PMC3995935 DOI: 10.1371/journal.pone.0095501] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Accepted: 03/26/2014] [Indexed: 12/20/2022] Open
Abstract
Human cytomegalovirus (HCMV) is a ubiquitous virus that can cause serious sequelae in immunocompromised patients and in the developing fetus. The coding capacity of the 235 kbp genome is still incompletely understood, and there is a pressing need to characterize genomic contents in clinical isolates. In this study, a procedure for the high-throughput generation of full genome consensus sequences from clinical HCMV isolates is presented. This method relies on low number passaging of clinical isolates on human fibroblasts, followed by digestion of cellular DNA and purification of viral DNA. After multiple displacement amplification, highly pure viral DNA is generated. These extracts are suitable for high-throughput next-generation sequencing and assembly of consensus sequences. Throughout a series of validation experiments, we showed that the workflow reproducibly generated consensus sequences representative for the virus population present in the original clinical material. Additionally, the performance of 454 GS FLX and/or Illumina Genome Analyzer datasets in consensus sequence deduction was evaluated. Based on assembly performance data, the Illumina Genome Analyzer was the platform of choice in the presented workflow. Analysis of the consensus sequences derived in this study confirmed the presence of gene-disrupting mutations in clinical HCMV isolates independent from in vitro passaging. These mutations were identified in genes RL5A, UL1, UL9, UL111A and UL150. In conclusion, the presented workflow provides opportunities for high-throughput characterization of complete HCMV genomes that could deliver new insights into HCMV coding capacity and genetic determinants of viral tropism and pathogenicity.
Collapse
Affiliation(s)
- Steven Sijmons
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
- * E-mail:
| | - Kim Thys
- Janssen Infectious Diseases BVBA, Beerse, Belgium
| | - Michaël Corthout
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | | | - Stefanie Bollen
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Sylvie Baguet
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | - Marc Van Ranst
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Piet Maes
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| |
Collapse
|
113
|
Wang C, Grohme MA, Mali B, Schill RO, Frohme M. Towards decrypting cryptobiosis--analyzing anhydrobiosis in the tardigrade Milnesium tardigradum using transcriptome sequencing. PLoS One 2014; 9:e92663. [PMID: 24651535 PMCID: PMC3961413 DOI: 10.1371/journal.pone.0092663] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2013] [Accepted: 02/25/2014] [Indexed: 11/18/2022] Open
Abstract
Background Many tardigrade species are capable of anhydrobiosis; however, mechanisms underlying their extreme desiccation resistance remain elusive. This study attempts to quantify the anhydrobiotic transcriptome of the limno-terrestrial tardigrade Milnesium tardigradum. Results A prerequisite for differential gene expression analysis was the generation of a reference hybrid transcriptome atlas by assembly of Sanger, 454 and Illumina sequence data. The final assembly yielded 79,064 contigs (>100 bp) after removal of ribosomal RNAs. Around 50% of them could be annotated by SwissProt and NCBI non-redundant protein sequences. Analysis using CEGMA predicted 232 (93.5%) out of the 248 highly conserved eukaryotic genes in the assembly. We used this reference transcriptome for mapping and quantifying the expression of transcripts regulated under anhdydrobiosis in a time-series during dehydration and rehydration. 834 of the transcripts were found to be differentially expressed in a single stage (dehydration/inactive tun/rehydration) and 184 were overlapping in two stages while 74 were differentially expressed in all three stages. We have found interesting patterns of differentially expressed transcripts that are in concordance with a common hypothesis of metabolic shutdown during anhydrobiosis. This included down-regulation of several proteins of the DNA replication and translational machinery and protein degradation. Among others, heat shock proteins Hsp27 and Hsp30c were up-regulated in response to dehydration and rehydration. In addition, we observed up-regulation of ployubiquitin-B upon rehydration together with a higher expression level of several DNA repair proteins during rehydration than in the dehydration stage. Conclusions Most of the transcripts identified to be differentially expressed had distinct cellular function. Our data suggest a concerted molecular adaptation in M. tardigradum that permits extreme forms of ametabolic states such as anhydrobiosis. It is temping to surmise that the desiccation tolerance of tradigrades can be achieved by a constitutive cellular protection system, probably in conjunction with other mechanisms such as rehydration-induced cellular repair.
Collapse
Affiliation(s)
- Chong Wang
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- * E-mail:
| | - Markus A. Grohme
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Brahim Mali
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Ralph O. Schill
- Biological Institute, Zoology, University of Stuttgart, Stuttgart, Germany
| | - Marcus Frohme
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| |
Collapse
|
114
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
115
|
Nijkamp JF, Pop M, Reinders MJT, de Ridder D. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 2013; 29:2826-34. [PMID: 24058058 DOI: 10.1093/bioinformatics/btt502] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Although many tools are available to study variation and its impact in single genomes, there is a lack of algorithms for finding such variation in metagenomes. This hampers the interpretation of metagenomics sequencing datasets, which are increasingly acquired in research on the (human) microbiome, in environmental studies and in the study of processes in the production of foods and beverages. Existing algorithms often depend on the use of reference genomes, which pose a problem when a metagenome of a priori unknown strain composition is studied. In this article, we develop a method to perform reference-free detection and visual exploration of genomic variation, both within a single metagenome and between metagenomes. RESULTS We present the MaryGold algorithm and its implementation, which efficiently detects bubble structures in contig graphs using graph decomposition. These bubbles represent variable genomic regions in closely related strains in metagenomic samples. The variation found is presented in a condensed Circos-based visualization, which allows for easy exploration and interpretation of the found variation. We validated the algorithm on two simulated datasets containing three respectively seven Escherichia coli genomes and showed that finding allelic variation in these genomes improves assemblies. Additionally, we applied MaryGold to publicly available real metagenomic datasets, enabling us to find within-sample genomic variation in the metagenomes of a kimchi fermentation process, the microbiome of a premature infant and in microbial communities living on acid mine drainage. Moreover, we used MaryGold for between-sample variation detection and exploration by comparing sequencing data sampled at different time points for both of these datasets. AVAILABILITY MaryGold has been written in C++ and Python and can be downloaded from http://bioinformatics.tudelft.nl/software
Collapse
Affiliation(s)
- Jurgen F Nijkamp
- Department of Intelligent Systems, The Delft Bioinformatics Lab, Delft University of Technology, 2628 CD Delft, The Netherlands, Kluyver Centre for Genomics of Industrial Fermentation, 2600 GA Delft, The Netherlands and Department of Computer Science, Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | | | | | | |
Collapse
|
116
|
Links MG, Chaban B, Hemmingsen SM, Muirhead K, Hill JE. mPUMA: a computational approach to microbiota analysis by de novo assembly of operational taxonomic units based on protein-coding barcode sequences. MICROBIOME 2013; 1:23. [PMID: 24451012 PMCID: PMC3971603 DOI: 10.1186/2049-2618-1-23] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2013] [Accepted: 08/03/2013] [Indexed: 05/03/2023]
Abstract
BACKGROUND Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. RESULTS Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. CONCLUSIONS mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.
Collapse
Affiliation(s)
- Matthew G Links
- Agriculture and AgriFood Canada, 107 Science Place, S7N 0X2, Saskatoon, SK, Canada
- Department of Veterinary Microbiology, University of Saskatchewan, 52 Campus Drive, S7N 5B4, Saskatoon, SK, Canada
| | - Bonnie Chaban
- Department of Veterinary Microbiology, University of Saskatchewan, 52 Campus Drive, S7N 5B4, Saskatoon, SK, Canada
| | - Sean M Hemmingsen
- National Research Council Canada, 110 Gymnasium Place, S7N 0W9, Saskatoon, SK, Canada
- Department of Microbiology & Immunology, University of Saskatchewan, 107 Wiggins Road, S7N 5E5, Saskatoon, SK, Canada
| | - Kevin Muirhead
- National Research Council Canada, 110 Gymnasium Place, S7N 0W9, Saskatoon, SK, Canada
| | - Janet E Hill
- Agriculture and AgriFood Canada, 107 Science Place, S7N 0X2, Saskatoon, SK, Canada
| |
Collapse
|
117
|
Ruttink T, Sterck L, Rohde A, Bendixen C, Rouzé P, Asp T, Van de Peer Y, Roldan-Ruiz I. Orthology Guided Assembly in highly heterozygous crops: creating a reference transcriptome to uncover genetic diversity in Lolium perenne. PLANT BIOTECHNOLOGY JOURNAL 2013; 11:605-17. [PMID: 23433242 DOI: 10.1111/pbi.12051] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2012] [Revised: 01/05/2013] [Accepted: 01/11/2013] [Indexed: 05/09/2023]
Abstract
Despite current advances in next-generation sequencing data analysis procedures, de novo assembly of a reference sequence required for SNP discovery and expression analysis is still a major challenge in genetically uncharacterized, highly heterozygous species. High levels of polymorphism inherent to outbreeding crop species hamper De Bruijn Graph-based de novo assembly algorithms, causing transcript fragmentation and the redundant assembly of allelic contigs. If multiple genotypes are sequenced to study genetic diversity, primary de novo assembly is best performed per genotype to limit the level of polymorphism and avoid transcript fragmentation. Here, we propose an Orthology Guided Assembly procedure that first uses sequence similarity (tBLASTn) to proteins of a model species to select allelic and fragmented contigs from all genotypes and then performs CAP3 clustering on a gene-by-gene basis. Thus, we simultaneously annotate putative orthologues for each protein of the model species, resolve allelic redundancy and fragmentation and create a de novo transcript sequence representing the consensus of all alleles present in the sequenced genotypes. We demonstrate the procedure using RNA-seq data from 14 genotypes of Lolium perenne to generate a reference transcriptome for gene discovery and translational research, to reveal the transcriptome-wide distribution and density of SNPs in an outbreeding crop and to illustrate the effect of polymorphisms on the assembly procedure. The results presented here illustrate that constructing a non-redundant reference sequence is essential for comparative genomics, orthology-based annotation and candidate gene selection but also for read mapping and subsequent polymorphism discovery and/or read count-based gene expression analysis.
Collapse
Affiliation(s)
- Tom Ruttink
- Plant Sciences Unit--Growth and Development, Institute for Agricultural and Fisheries Research-ILVO, Melle, Belgium.
| | | | | | | | | | | | | | | |
Collapse
|
118
|
Nakasugi K, Crowhurst RN, Bally J, Wood CC, Hellens RP, Waterhouse PM. De novo transcriptome sequence assembly and analysis of RNA silencing genes of Nicotiana benthamiana. PLoS One 2013; 8:e59534. [PMID: 23555698 PMCID: PMC3610648 DOI: 10.1371/journal.pone.0059534] [Citation(s) in RCA: 143] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 02/15/2013] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Nicotiana benthamiana has been widely used for transient gene expression assays and as a model plant in the study of plant-microbe interactions, lipid engineering and RNA silencing pathways. Assembling the sequence of its transcriptome provides information that, in conjunction with the genome sequence, will facilitate gaining insight into the plant's capacity for high-level transient transgene expression, generation of mobile gene silencing signals, and hyper-susceptibility to viral infection. METHODOLOGY/RESULTS RNA-seq libraries from 9 different tissues were deep sequenced and assembled, de novo, into a representation of the transcriptome. The assembly, of 16GB of sequence, yielded 237,340 contigs, clustering into 119,014 transcripts (unigenes). Between 80 and 85% of reads from all tissues could be mapped back to the full transcriptome. Approximately 63% of the unigenes exhibited a match to the Solgenomics tomato predicted proteins database. Approximately 94% of the Solgenomics N. benthamiana unigene set (16,024 sequences) matched our unigene set (119,014 sequences). Using homology searches we identified 31 homologues that are involved in RNAi-associated pathways in Arabidopsis thaliana, and show that they possess the domains characteristic of these proteins. Of these genes, the RNA dependent RNA polymerase gene, Rdr1, is transcribed but has a 72 nt insertion in exon1 that would cause premature termination of translation. Dicer-like 3 (DCL3) appears to lack both the DEAD helicase motif and second dsRNA binding motif, and DCL2 and AGO4b have unexpectedly high levels of transcription. CONCLUSIONS The assembled and annotated representation of the transcriptome and list of RNAi-associated sequences are accessible at www.benthgenome.com alongside a draft genome assembly. These genomic resources will be very useful for further study of the developmental, metabolic and defense pathways of N. benthamiana and in understanding the mechanisms behind the features which have made it such a well-used model plant.
Collapse
Affiliation(s)
- Kenlee Nakasugi
- School of Molecular Bioscience, University of Sydney, Sydney, Australia
| | - Ross N. Crowhurst
- Mount Albert Research Centre, Plant and Food Research, Auckland, New Zealand
| | - Julia Bally
- School of Molecular Bioscience, University of Sydney, Sydney, Australia
| | - Craig C. Wood
- Commonwealth Scientific and Industrial Research Organisation–Plant Industry, Canberra, Australia
| | - Roger P. Hellens
- Mount Albert Research Centre, Plant and Food Research, Auckland, New Zealand
| | | |
Collapse
|
119
|
CLARKE K, YANG Y, MARSH R, XIE L, ZHANG KK. Comparative analysis of de novo transcriptome assembly. SCIENCE CHINA-LIFE SCIENCES 2013; 56:156-62. [PMID: 23393031 PMCID: PMC5778448 DOI: 10.1007/s11427-013-4444-x] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2012] [Accepted: 12/28/2012] [Indexed: 12/03/2022]
Abstract
The fast development of next-generation sequencing technology presents a major computational challenge for data processing and analysis. A fast algorithm, de Bruijn graph has been successfully used for genome DNA de novo assembly; nevertheless, its performance for transcriptome assembly is unclear. In this study, we used both simulated and real RNA-Seq data, from either artificial RNA templates or human transcripts, to evaluate five de novo assemblers, ABySS, Mira, Trinity, Velvet and Oases. Of these assemblers, ABySS, Trinity, Velvet and Oases are all based on de Bruijn graph, and Mira uses an overlap graph algorithm. Various numbers of RNA short reads were selected from the External RNA Control Consortium (ERCC) data and human chromosome 22. A number of statistics were then calculated for the resulting contigs from each assembler. Each experiment was repeated multiple times to obtain the mean statistics and standard error estimate. Trinity had relative good performance for both ERCC and human data, but it may not consistently generate full length transcripts. ABySS was the fastest method but its assembly quality was low. Mira gave a good rate for mapping its contigs onto human chromosome 22, but its computational speed is not satisfactory. Our results suggest that transcript assembly remains a challenge problem for bioinformatics society. Therefore, a novel assembler is in need for assembling transcriptome data generated by next generation sequencing technique.
Collapse
Affiliation(s)
- Kaitlin CLARKE
- Bioinformatics Core, Department of Pathology, University of North Dakota, Grand Forks, ND 58202, USA
| | - Yi YANG
- Bioinformatics Core, Department of Pathology, University of North Dakota, Grand Forks, ND 58202, USA
- Department of Computer Science, University of North Dakota, Grand Forks, ND 58202, USA
| | - Ronald MARSH
- Department of Computer Science, University of North Dakota, Grand Forks, ND 58202, USA
| | - LingLin XIE
- Department of Biochemistry and Molecular Biology, University of North Dakota, Grand Forks, ND 58202, USA
| | - Ke K. ZHANG
- Bioinformatics Core, Department of Pathology, University of North Dakota, Grand Forks, ND 58202, USA
- Corresponding author ()
| |
Collapse
|
120
|
Dutilh BE, Schmieder R, Nulton J, Felts B, Salamon P, Edwards RA, Mokili JL. Reference-independent comparative metagenomics using cross-assembly: crAss. ACTA ACUST UNITED AC 2012; 28:3225-31. [PMID: 23074261 PMCID: PMC3519457 DOI: 10.1093/bioinformatics/bts613] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
MOTIVATION Metagenomes are often characterized by high levels of unknown sequences. Reads derived from known microorganisms can easily be identified and analyzed using fast homology search algorithms and a suitable reference database, but the unknown sequences are often ignored in further analyses, biasing conclusions. Nevertheless, it is possible to use more data in a comparative metagenomic analysis by creating a cross-assembly of all reads, i.e. a single assembly of reads from different samples. Comparative metagenomics studies the interrelationships between metagenomes from different samples. Using an assembly algorithm is a fast and intuitive way to link (partially) homologous reads without requiring a database of reference sequences. RESULTS Here, we introduce crAss, a novel bioinformatic tool that enables fast simple analysis of cross-assembly files, yielding distances between all metagenomic sample pairs and an insightful image displaying the similarities.
Collapse
Affiliation(s)
- Bas E Dutilh
- Centre for Molecular and Biomolecular Informatics, Nijmegen Centre for Molecular Life Sciences, Radboud University Medical Centre, 6525 GA Nijmegen, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
121
|
Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, Shi Y, Zhang H, Li Y, Lam TW, Luo R. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 2012; 28:2870-4. [PMID: 23044551 DOI: 10.1093/bioinformatics/bts563] [Citation(s) in RCA: 111] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION The boost of next-generation sequencing technologies provides us with an unprecedented opportunity for elucidating genetic mysteries, yet the short-read length hinders us from better assembling the genome from scratch. New protocols now exist that can generate overlapping pair-end reads. By joining the 3' ends of each read pair, one is able to construct longer reads for assembling. However, effectively joining two overlapped pair-end reads remains a challenging task. RESULT In this article, we present an efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads. AVAILABILITY AND IMPLEMENTATION COPE is implemented in C++ and is freely available as open-source code at ftp://ftp.genomics.org.cn/pub/cope. CONTACT twlam@cs.hku.hk or luoruibang@genomics.org.cn
Collapse
Affiliation(s)
- Binghang Liu
- HKU-BGI BAL-Bioinformatics Algorithms and Core Technology Research Laboratory, The University of Hong Kong, Hong Kong
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
122
|
Logares R, Haverkamp TH, Kumar S, Lanzén A, Nederbragt AJ, Quince C, Kauserud H. Environmental microbiology through the lens of high-throughput DNA sequencing: Synopsis of current platforms and bioinformatics approaches. J Microbiol Methods 2012; 91:106-13. [DOI: 10.1016/j.mimet.2012.07.017] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2012] [Revised: 07/19/2012] [Accepted: 07/23/2012] [Indexed: 10/28/2022]
|
123
|
Why assembling plant genome sequences is so challenging. BIOLOGY 2012; 1:439-59. [PMID: 24832233 PMCID: PMC4009782 DOI: 10.3390/biology1020439] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Revised: 09/05/2012] [Accepted: 09/06/2012] [Indexed: 12/16/2022]
Abstract
In spite of the biological and economic importance of plants, relatively few plant species have been sequenced. Only the genome sequence of plants with relatively small genomes, most of them angiosperms, in particular eudicots, has been determined. The arrival of next-generation sequencing technologies has allowed the rapid and efficient development of new genomic resources for non-model or orphan plant species. But the sequencing pace of plants is far from that of animals and microorganisms. This review focuses on the typical challenges of plant genomes that can explain why plant genomics is less developed than animal genomics. Explanations about the impact of some confounding factors emerging from the nature of plant genomes are given. As a result of these challenges and confounding factors, the correct assembly and annotation of plant genomes is hindered, genome drafts are produced, and advances in plant genomics are delayed.
Collapse
|
124
|
An efficient algorithm for DNA fragment assembly in MapReduce. Biochem Biophys Res Commun 2012; 426:395-8. [PMID: 22960169 DOI: 10.1016/j.bbrc.2012.08.101] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2012] [Accepted: 08/21/2012] [Indexed: 11/22/2022]
Abstract
Fragment assembly is one of the most important problems of sequence assembly. Algorithms for DNA fragment assembly using de Bruijn graph have been widely used. These algorithms require a large amount of memory and running time to build the de Bruijn graph. Another drawback of the conventional de Bruijn approach is the loss of information. To overcome these shortcomings, this paper proposes a parallel strategy to construct de Bruijin graph. Its main characteristic is to avoid the division of de Bruijin graph. A novel fragment assembly algorithm based on our parallel strategy is implemented in the MapReduce framework. The experimental results show that the parallel strategy can effectively improve the computational efficiency and remove the memory limitations of the assembly algorithm based on Euler superpath. This paper provides a useful attempt to the assembly of large-scale genome sequence using Cloud Computing.
Collapse
|