51
|
Sharpe RM, Williamson-Benavides B, Edwards GE, Dhingra A. Methods of analysis of chloroplast genomes of C 3, Kranz type C 4 and Single Cell C 4 photosynthetic members of Chenopodiaceae. PLANT METHODS 2020; 16:119. [PMID: 32874195 PMCID: PMC7457496 DOI: 10.1186/s13007-020-00662-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Accepted: 08/20/2020] [Indexed: 06/11/2023]
Abstract
BACKGROUND Chloroplast genome information is critical to understanding forms of photosynthesis in the plant kingdom. During the evolutionary process, plants have developed different photosynthetic strategies that are accompanied by complementary biochemical and anatomical features. Members of family Chenopodiaceae have species with C3 photosynthesis, and variations of C4 photosynthesis in which photorespiration is reduced by concentrating CO2 around Rubisco through dual coordinated functioning of dimorphic chloroplasts. Among dicots, the family has the largest number of C4 species, and greatest structural and biochemical diversity in forms of C4 including the canonical dual-cell Kranz anatomy, and the recently identified single cell C4 with the presence of dimorphic chloroplasts separated by a vacuole. This is the first comparative analysis of chloroplast genomes in species representative of photosynthetic types in the family. RESULTS Methodology with high throughput sequencing complemented with Sanger sequencing of selected loci provided high quality and complete chloroplast genomes of seven species in the family and one species in the closely related Amaranthaceae family, representing C3, Kranz type C4 and single cell C4 (SSC4) photosynthesis six of the eight chloroplast genomes are new, while two are improved versions of previously published genomes. The depth of coverage obtained using high-throughput sequencing complemented with targeted resequencing of certain loci enabled superior resolution of the border junctions, directionality and repeat region sequences. Comparison of the chloroplast genomes with previously sequenced plastid genomes revealed similar genome organization, gene order and content with a few revisions. High-quality complete chloroplast genome sequences resulted in correcting the orientation the LSC region of the published Bienertia sinuspersici chloroplast genome, identification of stop codons in the rpl23 gene in B. sinuspersici and B. cycloptera, and identifying an instance of IR expansion in the Haloxylon ammodendron inverted repeat sequence. The rare observation of a mitochondria-to-chloroplast inter-organellar gene transfer event was identified in family Chenopodiaceae. CONCLUSIONS This study reports complete chloroplast genomes from seven Chenopodiaceae and one Amaranthaceae species. The depth of coverage obtained using high-throughput sequencing complemented with targeted resequencing of certain loci enabled superior resolution of the border junctions, directionality, and repeat region sequences. Therefore, the use of high throughput and Sanger sequencing, in a hybrid method, reaffirms to be rapid, efficient, and reliable for chloroplast genome sequencing.
Collapse
Affiliation(s)
- Richard M. Sharpe
- Department of Horticulture, Washington State University, Pullman, WA 99164 USA
| | - Bruce Williamson-Benavides
- Department of Horticulture, Washington State University, Pullman, WA 99164 USA
- Molecular Plants Sciences, Washington State University, Pullman, WA 99164 USA
| | - Gerald E. Edwards
- Molecular Plants Sciences, Washington State University, Pullman, WA 99164 USA
- School of Biological Sciences, Washington State University, Pullman, WA 99164 USA
| | - Amit Dhingra
- Department of Horticulture, Washington State University, Pullman, WA 99164 USA
- Molecular Plants Sciences, Washington State University, Pullman, WA 99164 USA
| |
Collapse
|
52
|
Yuan Y, Chung CYL, Chan TF. Advances in optical mapping for genomic research. Comput Struct Biotechnol J 2020; 18:2051-2062. [PMID: 32802277 PMCID: PMC7419273 DOI: 10.1016/j.csbj.2020.07.018] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/08/2020] [Accepted: 07/24/2020] [Indexed: 12/28/2022] Open
Abstract
Recent advances in optical mapping have allowed the construction of improved genome assemblies with greater contiguity. Optical mapping also enables genome comparison and identification of large-scale structural variations. Association of these large-scale genomic features with biological functions is an important goal in plant and animal breeding and in medical research. Optical mapping has also been used in microbiology and still plays an important role in strain typing and epidemiological studies. Here, we review the development of optical mapping in recent decades to illustrate its importance in genomic research. We detail its applications and algorithms to show its specific advantages. Finally, we discuss the challenges required to facilitate the optimization of optical mapping and improve its future development and application.
Collapse
Key Words
- 3D, three-dimensional
- DBG, de Bruijn graph
- DLS, direct label and strain
- DNA, deoxyribonucleic acid
- Genome assembly
- Hi-C, high-throughput chromosome conformation capture
- Mb, million base pair
- Next generation sequencing
- OLC, overlap-layout-consensus
- Optical mapping
- PCR, polymerase chain reaction
- PacBio, Pacific Biosciences
- SRS, short-read sequencing
- SV, structural variation
- Structural variation
- bp, base pair
- kb, kilobase pair
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Claire Yik-Lok Chung
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
53
|
Pereira R, Oliveira J, Sousa M. Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. J Clin Med 2020; 9:E132. [PMID: 31947757 PMCID: PMC7019349 DOI: 10.3390/jcm9010132] [Citation(s) in RCA: 94] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 12/15/2019] [Accepted: 12/30/2019] [Indexed: 12/13/2022] Open
Abstract
Clinical genetics has an important role in the healthcare system to provide a definitive diagnosis for many rare syndromes. It also can have an influence over genetics prevention, disease prognosis and assisting the selection of the best options of care/treatment for patients. Next-generation sequencing (NGS) has transformed clinical genetics making possible to analyze hundreds of genes at an unprecedented speed and at a lower price when comparing to conventional Sanger sequencing. Despite the growing literature concerning NGS in a clinical setting, this review aims to fill the gap that exists among (bio)informaticians, molecular geneticists and clinicians, by presenting a general overview of the NGS technology and workflow. First, we will review the current NGS platforms, focusing on the two main platforms Illumina and Ion Torrent, and discussing the major strong points and weaknesses intrinsic to each platform. Next, the NGS analytical bioinformatic pipelines are dissected, giving some emphasis to the algorithms commonly used to generate process data and to analyze sequence variants. Finally, the main challenges around NGS bioinformatics are placed in perspective for future developments. Even with the huge achievements made in NGS technology and bioinformatics, further improvements in bioinformatic algorithms are still required to deal with complex and genetically heterogeneous disorders.
Collapse
Affiliation(s)
- Rute Pereira
- Laboratory of Cell Biology, Department of Microscopy, Institute of Biomedical Sciences Abel Salazar (ICBAS), University of Porto (UP), 4050-313 Porto, Portugal;
- Biology and Genetics of Reproduction Unit, Multidisciplinary Unit for Biomedical Research (UMIB), ICBAS-UP, 4050-313 Porto, Portugal;
| | - Jorge Oliveira
- Biology and Genetics of Reproduction Unit, Multidisciplinary Unit for Biomedical Research (UMIB), ICBAS-UP, 4050-313 Porto, Portugal;
- UnIGENe and CGPP–Centre for Predictive and Preventive Genetics-Institute for Molecular and Cell Biology (IBMC), i3S-Institute for Research and Innovation in Health-UP, 4200-135 Porto, Portugal
| | - Mário Sousa
- Laboratory of Cell Biology, Department of Microscopy, Institute of Biomedical Sciences Abel Salazar (ICBAS), University of Porto (UP), 4050-313 Porto, Portugal;
- Biology and Genetics of Reproduction Unit, Multidisciplinary Unit for Biomedical Research (UMIB), ICBAS-UP, 4050-313 Porto, Portugal;
| |
Collapse
|
54
|
Bayat A, Deshpande NP, Wilkins MR, Parameswaran S. Fast Short Read De-Novo Assembly Using Overlap-Layout-Consensus Approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:334-338. [PMID: 30307874 DOI: 10.1109/tcbb.2018.2875479] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The de-novo genome assembly is a challenging computational problem for which several pipelines have been developed. The advent of long-read sequencing technology has resulted in a new set of algorithmic approaches for the assembly process. In this work, we identify that one of these new and fast long-read assembly techniques (using Minimap2 and Miniasm) can be modified for the short-read assembly process. This possibility motivated us to customize a long-read assembly approach for applications in a short-read assembly scenario. Here, we compare and contrast our proposed de-novo assembly pipeline (MiniSR) with three other recently developed programs for the assembly of bacterial and small eukaryotic genomes. We have documented two trade-offs: one between speed and accuracy and the other between contiguity and base-calling errors. Our proposed assembly pipeline shows a good balance in these trade-offs. The resulting pipeline is 6 and 2.2 times faster than the short-read assemblers Spades and SGA, respectively. MiniSR generates assemblies of superior N50 and NGA50 to SGA, although assemblies are less complete and accurate than those from Spades. A third tool, SOAPdenovo2, is as fast as our proposed pipeline but had poorer assembly quality.
Collapse
|
55
|
Lee DH. Complete Genome Sequencing of Influenza A Viruses Using Next-Generation Sequencing. Methods Mol Biol 2020; 2123:69-79. [PMID: 32170681 DOI: 10.1007/978-1-0716-0346-8_6] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Recently, chain termination sequencing methods have been replaced by more efficient next-generation sequencing (NGS) methods. For influenza A, NGS allows for deep sequencing to characterize virus populations, efficient complete genome sequencing, and a non-sequence-dependent method to identify viral variants. There are numerous approaches to preparing samples for NGS and subsequent data processing methods that can be applied to influenza A sequencing. This chapter provides a brief overview of the process of NGS for influenza A and some useful bioinformatics tools for developing an NGS workflow for influenza A viruses.
Collapse
Affiliation(s)
- Dong-Hun Lee
- Department of Pathobiology and Veterinary Science, College of Agriculture, Health and Natural Resources, The University of Connecticut, Storrs, CT, USA.
| |
Collapse
|
56
|
Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0181-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
57
|
Liu Z, Dong W, Luo W, Jiang W, Li Q, He Z. HLMethy: a machine learning-based model to identify the hidden labels of m 6A candidates. PLANT MOLECULAR BIOLOGY 2019; 101:575-584. [PMID: 31722090 DOI: 10.1007/s11103-019-00930-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Accepted: 11/01/2019] [Indexed: 06/10/2023]
Abstract
We developed a machine learning-based model to identify the hidden labels of m6A candidates from noisy m6A-seq data. Peak-calling approaches, such as MeRIP-seq or m6A-seq, are commonly used to map m6A modifications. However, these technologies can only map m6A sites with 100-200 nt resolution and cannot reveal the precise location or the number of modified residues in a transcript. To address this challenge, we developed a novel machine learning-based approach, named HLMethy, to assign labels to m6A candidates from noisy m6A-seq data. The multiple instance learning framework was adopted and two different training strategies were used to generate the classification model. To test the performance of our model, the m6A sites with single-base resolution were used and our model achieved comparable performance against existing instance-level predictors, which suggest that our model has the potential to improve the data quality of m6A-seq at reduced costs. What's more, our generic framework can be extended to other newly found modifications that are found by peak-calling approaches. The source code of HLMethy is available at https://github.com/liuze-nwafu/HLMethy.
Collapse
Affiliation(s)
- Ze Liu
- College of Water Resources and Architectural Engineering, Northwest A & F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A & F University, Yangling, 712100, Shaanxi, China
| | - Wei Dong
- College of Water Resources and Architectural Engineering, Northwest A & F University, Yangling, 712100, Shaanxi, China.
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A & F University, Yangling, 712100, Shaanxi, China.
| | - WenJie Luo
- College of Water Resources and Architectural Engineering, Northwest A & F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A & F University, Yangling, 712100, Shaanxi, China
| | - Wei Jiang
- College of Water Resources and Architectural Engineering, Northwest A & F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A & F University, Yangling, 712100, Shaanxi, China
| | - QuanWu Li
- College of Water Resources and Architectural Engineering, Northwest A & F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A & F University, Yangling, 712100, Shaanxi, China
| | - ZiLi He
- College of Water Resources and Architectural Engineering, Northwest A & F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A & F University, Yangling, 712100, Shaanxi, China
| |
Collapse
|
58
|
Giani AM, Gallo GR, Gianfranceschi L, Formenti G. Long walk to genomics: History and current approaches to genome sequencing and assembly. Comput Struct Biotechnol J 2019; 18:9-19. [PMID: 31890139 PMCID: PMC6926122 DOI: 10.1016/j.csbj.2019.11.002] [Citation(s) in RCA: 109] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Revised: 11/03/2019] [Accepted: 11/06/2019] [Indexed: 12/13/2022] Open
Abstract
Genomes represent the starting point of genetic studies. Since the discovery of DNA structure, scientists have devoted great efforts to determine their sequence in an exact way. In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones in genome sequencing and assembly, ranging from early sequencing methods to Next-Generation Sequencing platforms. We then focus on the advantages and challenges of the current technologies and approaches, collectively known as Third Generation Sequencing. As these technical advancements have been accompanied by progress in analytical methods, we also review the bioinformatic tools currently employed in de novo genome assembly, as well as some applications of Third Generation Sequencing technologies and high-quality reference genomes.
Collapse
Key Words
- BAC, Bacterial Artificial Chromosome
- Bioinformatics
- Genome assembly
- HGP, Human Genome Project
- HMW, high molecular weight
- HapMap, haplotype map
- NGS, Next Generation Sequencing
- Next-generation
- OLC, Overlap-Layout-Consensus
- QV, Quality Value (QV)
- Reference
- SBS, Sequencing by Synthesis
- SMRT, Single Molecule Real-Time
- SNPs, Single Nucleotide Polymorphisms
- SRA, Short Read Archive
- SV, Structural Variant
- Sequencing
- TGS, Third Generation Sequencing
- Third-generation
- WGS, Whole Genome Sequencing
- ZMW, Zero-Mode Waveguide
- bp, base pair
- dNTPs, deoxynucleoside triphosphates
- ddNTP, 2,3-dideoxynucleoside triphosphate
Collapse
Affiliation(s)
- Alice Maria Giani
- Department of Surgery, Weill Cornell Medical College, New York, NY, USA
| | | | | | | |
Collapse
|
59
|
Ang MY, Low TY, Lee PY, Wan Mohamad Nazarie WF, Guryev V, Jamal R. Proteogenomics: From next-generation sequencing (NGS) and mass spectrometry-based proteomics to precision medicine. Clin Chim Acta 2019; 498:38-46. [DOI: 10.1016/j.cca.2019.08.010] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 08/13/2019] [Accepted: 08/13/2019] [Indexed: 12/14/2022]
|
60
|
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics 2019; 34:2927-2935. [PMID: 29617936 DOI: 10.1093/bioinformatics/bty202] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 04/02/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation RNA virus populations contain different but genetically related strains, all infecting an individual host. Reconstruction of the viral haplotypes is a fundamental step to characterize the virus population, predict their viral phenotypes and finally provide important information for clinical treatment and prevention. Advances of the next-generation sequencing technologies open up new opportunities to assemble full-length haplotypes. However, error-prone short reads, high similarities between related strains, an unknown number of haplotypes pose computational challenges for reference-free haplotype reconstruction. There is still much room to improve the performance of existing haplotype assembly tools. Results In this work, we developed a de novo haplotype reconstruction tool named PEHaplo, which employs paired-end reads to distinguish highly similar strains for viral quasispecies data. It was applied on both simulated and real quasispecies data, and the results were benchmarked against several recently published de novo haplotype reconstruction tools. The comparison shows that PEHaplo outperforms the benchmarked tools in a comprehensive set of metrics. Availability and implementation The source code and the documentation of PEHaplo are available at https://github.com/chjiao/PEHaplo. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiao Chen
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yingchao Zhao
- School of Computing and Information Sciences, Caritas Institute of Higher Education, Hong Kong, China
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
61
|
Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2019; 20:1542-1559. [PMID: 29617724 PMCID: PMC6781587 DOI: 10.1093/bib/bby017] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/06/2018] [Indexed: 02/06/2023] Open
Abstract
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.
Collapse
Affiliation(s)
- Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| |
Collapse
|
62
|
|
63
|
Koutsandreas T, Ladoukakis E, Pilalis E, Zarafeta D, Kolisis FN, Skretas G, Chatziioannou AA. ANASTASIA: An Automated Metagenomic Analysis Pipeline for Novel Enzyme Discovery Exploiting Next Generation Sequencing Data. Front Genet 2019; 10:469. [PMID: 31178894 PMCID: PMC6543708 DOI: 10.3389/fgene.2019.00469] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 05/01/2019] [Indexed: 01/27/2023] Open
Abstract
Metagenomic analysis of environmental samples provides deep insight into the enzymatic mixture of the corresponding niches, capable of revealing peptide sequences with novel functional properties exploiting the high performance of next-generation sequencing (NGS) technologies. At the same time due to their ever increasing complexity, there is a compelling need for ever larger computational configurations to ensure proper bioinformatic analysis, and fine annotation. With the aiming to address the challenges of such an endeavor, we have developed a novel web-based application named ANASTASIA (automated nucleotide aminoacid sequences translational plAtform for systemic interpretation and analysis). ANASTASIA provides a rich environment of bioinformatic tools, either publicly available or novel, proprietary algorithms, integrated within numerous automated algorithmic workflows, and which enables versatile data processing tasks for (meta)genomic sequence datasets. ANASTASIA was initially developed in the framework of the European FP7 project HotZyme, whose aim was to perform exhaustive analysis of metagenomes derived from thermal springs around the globe and to discover new enzymes of industrial interest. ANASTASIA has evolved to become a stable and extensible environment for diversified, metagenomic, functional analyses for a range of applications overarching industrial biotechnology to biomedicine, within the frames of the ELIXIR-GR project. As a showcase, we report the successful in silico mining of a novel thermostable esterase termed “EstDZ4” from a metagenomic sample collected from a hot spring located in Krisuvik, Iceland.
Collapse
Affiliation(s)
- Theodoros Koutsandreas
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece.,e-NIOS Applications PC, Athens, Greece
| | - Efthymios Ladoukakis
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece.,Laboratory of Biotechnology, School of Chemical Engineering, National Technical University of Athens, Athens, Greece
| | - Eleftherios Pilalis
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece.,e-NIOS Applications PC, Athens, Greece
| | - Dimitra Zarafeta
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece
| | - Fragiskos N Kolisis
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece.,Laboratory of Biotechnology, School of Chemical Engineering, National Technical University of Athens, Athens, Greece
| | - Georgios Skretas
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece
| | - Aristotelis A Chatziioannou
- Institute of Chemical Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece.,e-NIOS Applications PC, Athens, Greece
| |
Collapse
|
64
|
Tian S, Yan H, Klee EW, Kalmbach M, Slager SL. Comparative analysis of de novo assemblers for variation discovery in personal genomes. Brief Bioinform 2019; 19:893-904. [PMID: 28407084 PMCID: PMC6169673 DOI: 10.1093/bib/bbx037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/08/2017] [Indexed: 12/30/2022] Open
Abstract
Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Eric W Klee
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.,Center for Individualized Medicine Bioinformatics Program, Mayo Clinic, USA
| | - Michael Kalmbach
- Division of Information Management and Analytics, Department of Information Technology, Mayo Clinic, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
65
|
Abstract
Affordable, high-throughput DNA sequencing has accelerated the pace of genome assembly over the past decade. Genome assemblies from high-throughput, short-read sequencing, however, are often not as contiguous as the first generation of genome assemblies. Whereas early genome assembly projects were often aided by clone maps or other mapping data, many current assembly projects forego these scaffolding data and only assemble genomes into smaller segments. Recently, new technologies have been invented that allow chromosome-scale assembly at a lower cost and faster speed than traditional methods. Here, we give an overview of the problem of chromosome-scale assembly and traditional methods for tackling this problem. We then review new technologies for chromosome-scale assembly and recent genome projects that used these technologies to create highly contiguous genome assemblies at low cost.
Collapse
Affiliation(s)
- Edward S. Rice
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA;,
| | - Richard E. Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA;,
- Dovetail Genomics, LLC, Santa Cruz, California 95060, USA
| |
Collapse
|
66
|
Celis JS, Wibberg D, Ramírez-Portilla C, Rupp O, Sczyrba A, Winkler A, Kalinowski J, Wilke T. Binning enables efficient host genome reconstruction in cnidarian holobionts. Gigascience 2018; 7:5039706. [PMID: 29917104 PMCID: PMC6049006 DOI: 10.1093/gigascience/giy075] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Accepted: 06/14/2018] [Indexed: 12/19/2022] Open
Abstract
Background Many cnidarians, including stony corals, engage in complex symbiotic associations, comprising the eukaryotic host, photosynthetic algae, and highly diverse microbial communities—together referred to as holobiont. This taxonomic complexity makes sequencing and assembling coral host genomes extremely challenging. Therefore, previous cnidarian genomic projects were based on symbiont-free tissue samples. However, this approach may not be applicable to the majority of cnidarian species for ecological reasons. We therefore evaluated the performance of an alternative method based on sequence binning for reconstructing the genome of the stony coral Porites rus from a hologenomic sample and compared it to traditional approaches. Results Our results demonstrate that binning performs well for hologenomic data, producing sufficient reads for assembling the draft genome of P. rus. An assembly evaluation based on operational criteria showed results that were comparable to symbiont-free approaches in terms of completeness and usefulness, despite a high degree of fragmentation in our assembly. In addition, we found that binning provides sufficient data for exploratory k-mer estimation of genomic features, such as genome size and heterozygosity. Conclusions Binning constitutes a powerful approach for disentangling taxonomically complex coral hologenomes. Considering the recent decline of coral reefs on the one hand and previous limitations to coral genome sequencing on the other hand, binning may facilitate rapid and reliable genome assembly. This study also provides an important milestone in advancing binning from the metagenomic to the hologenomic and from the prokaryotic to the eukaryotic level.
Collapse
Affiliation(s)
- Juan Sebastián Celis
- Animal Ecology and Systematics, Justus Liebig University Giessen. Heinrich-Buff-Ring 26-32 (IFZ), 35392 Giessen, Germany.,Corporation Center of Excellence in Marine Sciences, Cra 54 No 106-18, Bogotá, Colombia
| | - Daniel Wibberg
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Catalina Ramírez-Portilla
- Animal Ecology and Systematics, Justus Liebig University Giessen. Heinrich-Buff-Ring 26-32 (IFZ), 35392 Giessen, Germany.,Evolutionary Biology and Ecology, Université libre de Bruxelles, Av. Franklin D. Roosevelt 50, CP 160/12, B-1050 Brussels, Belgium
| | - Oliver Rupp
- Bioinformatics and Systems Biology, Justus Liebig University Giessen, Heinrich-Buff-Ring 58, 35392 Giessen, Germany
| | - Alexander Sczyrba
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Anika Winkler
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Jörn Kalinowski
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Thomas Wilke
- Animal Ecology and Systematics, Justus Liebig University Giessen. Heinrich-Buff-Ring 26-32 (IFZ), 35392 Giessen, Germany.,Corporation Center of Excellence in Marine Sciences, Cra 54 No 106-18, Bogotá, Colombia
| |
Collapse
|
67
|
Richards DJ, Renaud L, Agarwal N, Starr Hazard E, Hyde J, Hardiman G. De Novo Hepatic Transcriptome Assembly and Systems Level Analysis of Three Species of Dietary Fish, Sardinops sagax, Scomber japonicus, and Pleuronichthys verticalis. Genes (Basel) 2018; 9:genes9110521. [PMID: 30366465 PMCID: PMC6266404 DOI: 10.3390/genes9110521] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Accepted: 10/17/2018] [Indexed: 12/31/2022] Open
Abstract
The monitoring of marine species as sentinels for ecosystem health has long been a valuable tool worldwide, providing insight into how both anthropogenic pollution and naturally occurring phenomena (i.e., harmful algal blooms) may lead to human and animal dietary concerns. The marine environments contain many contaminants of anthropogenic origin that have sufficient similarities to steroid and thyroid hormones, to potentially disrupt normal endocrine physiology in humans, fish, and other animals. An appropriate understanding of the effects of these endocrine disrupting chemicals (EDCs) on forage fish (e.g., sardine, anchovy, mackerel) can lead to significant insight into how these contaminants may affect local ecosystems in addition to their potential impacts on human health. With advancements in molecular tools (e.g., high-throughput sequencing, HTS), a genomics approach offers a robust toolkit to discover putative genetic biomarkers in fish exposed to these chemicals. However, the lack of available sequence information for non-model species has limited the development of these genomic toolkits. Using HTS and de novo assembly technology, the present study aimed to establish, for the first time for Sardinops sagax (Pacific sardine), Scomber japonicas (Pacific chub mackerel) and Pleuronichthys verticalis (hornyhead turbot), a de novo global transcriptome database of the liver, the primary organ involved in detoxification. The assembled transcriptomes provide a foundation for further downstream validation, comparative genomic analysis and biomarker development for future applications in ecotoxicogenomic studies, as well as environmental evaluation (e.g., climate change) and public health safety (e.g., dietary screening).
Collapse
Affiliation(s)
- Dylan J Richards
- Bioengineering Department, Clemson University, Charleston, SC 29425, USA.
| | - Ludivine Renaud
- Department of Medicine, Medical University of South Carolina, Charleston, SC 29425, USA.
- Center for Genomic Medicine, Bioinformatics, Medical University of South Carolina, Charleston, SC 29425, USA.
| | - Nisha Agarwal
- Biomedical Informatics Research Center, San Diego State University, San Diego, CA 92182, USA.
| | - E Starr Hazard
- Center for Genomic Medicine, Bioinformatics, Medical University of South Carolina, Charleston, SC 29425, USA.
- Academic Affairs Faculty & Computational Biology Resource Center, Medical University of South Carolina, Charleston, SC 29425, USA.
| | - John Hyde
- NOAA Fisheries, Southwest Fisheries Science Center, La Jolla, CA 92037, USA.
| | - Gary Hardiman
- Department of Medicine, Medical University of South Carolina, Charleston, SC 29425, USA.
- Center for Genomic Medicine, Bioinformatics, Medical University of South Carolina, Charleston, SC 29425, USA.
- Biomedical Informatics Research Center, San Diego State University, San Diego, CA 92182, USA.
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
- Laboratory for Marine Systems Biology, Hollings Marine Laboratory, Charleston, SC 29412, USA.
- School of Biological Sciences & Institute for Global Food Security, Queens University Belfast, Stranmillis Road, Belfast BT9 5AG, UK.
| |
Collapse
|
68
|
Aspeling-Jones H, Conway DJ. An expanded global inventory of allelic variation in the most extremely polymorphic region of Plasmodium falciparum merozoite surface protein 1 provided by short read sequence data. Malar J 2018; 17:345. [PMID: 30285849 PMCID: PMC6167803 DOI: 10.1186/s12936-018-2475-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 09/04/2018] [Indexed: 12/15/2022] Open
Abstract
Background Within Plasmodium falciparum merozoite surface protein 1 (MSP1), the N-terminal block 2 region is a highly polymorphic target of naturally acquired antibody responses. The antigenic diversity is determined by complex repeat sequences as well as non-repeat sequences, grouping into three major allelic types that appear to be maintained within populations by natural selection. Within these major types, many distinct allelic sequences have been described in different studies, but the extent and significance of the diversity remains unresolved. Methods To survey the diversity more extensively, block 2 allelic sequences in the msp1 gene were characterized in 2400 P. falciparum infection isolates with whole genome short read sequence data available from the Pf3K project, and compared with the data from previous studies. Results Mapping the short read sequence data in the 2400 isolates to a reference library of msp1 block 2 allelic sequences yielded 3815 allele scores at the level of major allelic family types, with 46% of isolates containing two or more of these major types. Overall frequencies were similar to those previously reported in other samples with different methods, the K1-like allelic type being most common in Africa, MAD20-like most common in Southeast Asia, and RO33-like being the third most abundant type in each continent. The rare MR type, formed by recombination between MAD20-like and RO33-like alleles, was only seen in Africa and very rarely in the Indian subcontinent but not in Southeast Asia. A combination of mapped short read assembly approaches enabled 1522 complete msp1 block 2 sequences to be determined, among which there were 363 different allele sequences, of which 246 have not been described previously. In these data, the K1-like msp1 block 2 alleles are most diverse and encode 225 distinct amino acid sequences, compared with 123 different MAD20-like, 9 RO33-like and 6 MR type sequences. Within each of the major types, the different allelic sequences show highly skewed geographical distributions, with most of the more common sequences being detected in either Africa or Asia, but not in both. Conclusions Allelic sequences of this extremely polymorphic locus have been derived from whole genome short read sequence data by mapping to a reference library followed by assembly of mapped reads. The catalogue of sequence variation has been greatly expanded, so that there are now more than 500 different msp1 block 2 allelic sequences described. This provides an extensive reference for molecular epidemiological genotyping and sequencing studies, and potentially for design of a multi-allelic vaccine. Electronic supplementary material The online version of this article (10.1186/s12936-018-2475-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Harvey Aspeling-Jones
- Pathogen Molecular Biology Department, School of Hygiene and Tropical Medicine London, Keppel St, London, WC1E 7HT, UK.
| | - David J Conway
- Pathogen Molecular Biology Department, School of Hygiene and Tropical Medicine London, Keppel St, London, WC1E 7HT, UK.
| |
Collapse
|
69
|
Yoon S, Kim D, Kang K, Park WJ. TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix. BMC Genomics 2018; 19:653. [PMID: 30180798 PMCID: PMC6123912 DOI: 10.1186/s12864-018-5034-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 08/23/2018] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND The challenges when developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize a de Bruijn graph, with which complexity grows linearly with data size while suffering from errors and repeats. Although one can correct the errors by inspecting the topological structure of the graph, this is not an easy task when there are too many branches. Two research directions are to improve either the graph reliability or the path search precision, and in this study, we focused on the former. RESULTS We present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. In the proposed approach, we built contigs by direct read alignment within a distance margin and performed a junction search to construct splicing graphs. While doing so, a contig of length l was represented by a 4 × l matrix (called a consensus matrix), in which each element was the base count of the aligned reads so far. A representative sequence was obtained by taking the majority in each column of the consensus matrix to be used for further read alignment. Once the splicing graphs had been obtained, we used IsoLasso to find paths with a noticeable read depth. The experiments using real and simulated reads show that the method provided considerable improvement in sensitivity and moderately better performance when comparing sensitivity and precision. This was achieved by the error-aware graph construction using the consensus matrix, with which the reads having errors were made usable for the graph construction (otherwise, they might have been eventually discarded). This improved the quality of the coverage depth information used in the subsequent path search step and finally the reliability of the graph. CONCLUSIONS De novo assembly is mainly used to explore undiscovered isoforms and must be able to represent as many reads as possible in an efficient way. In this sense, TraRECo provides us with a potential alternative for improving graph reliability even though the computational burden is much higher than the single k-mer in the de Bruijn graph approach.
Collapse
Affiliation(s)
- Seokhyun Yoon
- Department of Electronics Eng., College of Engineering, Dankook University, Yongin-si, Korea
| | - Daeseung Kim
- Department of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea
| | - Keunsoo Kang
- Department of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea.
| | - Woong June Park
- Department of Molecular Biology, College of Natural Sciences, Dankook University, Cheonan-si, Korea
| |
Collapse
|
70
|
Rando HM, Farré M, Robson MP, Won NB, Johnson JL, Buch R, Bastounes ER, Xiang X, Feng S, Liu S, Xiong Z, Kim J, Zhang G, Trut LN, Larkin DM, Kukekova AV. Construction of Red Fox Chromosomal Fragments from the Short-Read Genome Assembly. Genes (Basel) 2018; 9:E308. [PMID: 29925783 PMCID: PMC6027122 DOI: 10.3390/genes9060308] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2018] [Revised: 05/19/2018] [Accepted: 06/04/2018] [Indexed: 01/08/2023] Open
Abstract
The genome of a red fox (Vulpes vulpes) was recently sequenced and assembled using next-generation sequencing (NGS). The assembly is of high quality, with 94X coverage and a scaffold N50 of 11.8 Mbp, but is split into 676,878 scaffolds, some of which are likely to contain assembly errors. Fragmentation and misassembly hinder accurate gene prediction and downstream analysis such as the identification of loci under selection. Therefore, assembly of the genome into chromosome-scale fragments was an important step towards developing this genomic model. Scaffolds from the assembly were aligned to the dog reference genome and compared to the alignment of an outgroup genome (cat) against the dog to identify syntenic sequences among species. The program Reference-Assisted Chromosome Assembly (RACA) then integrated the comparative alignment with the mapping of the raw sequencing reads generated during assembly against the fox scaffolds. The 128 sequence fragments RACA assembled were compared to the fox meiotic linkage map to guide the construction of 40 chromosomal fragments. This computational approach to assembly was facilitated by prior research in comparative mammalian genomics, and the continued improvement of the red fox genome can in turn offer insight into canid and carnivore chromosome evolution. This assembly is also necessary for advancing genetic research in foxes and other canids.
Collapse
Affiliation(s)
- Halie M Rando
- Illinois Informatics Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
- Department of Animal Science, College of Agricultural, Consumer and Environmental Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | - Marta Farré
- Department of Comparative Biomedical Science, Royal Veterinary College, London NW1 0TU, UK.
| | - Michael P Robson
- Department of Computer Science, College of Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | - Naomi B Won
- Department of Animal Science, College of Agricultural, Consumer and Environmental Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | - Jennifer L Johnson
- Department of Animal Science, College of Agricultural, Consumer and Environmental Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | - Ronak Buch
- Department of Computer Science, College of Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | - Estelle R Bastounes
- Department of Animal Science, College of Agricultural, Consumer and Environmental Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | - Xueyan Xiang
- China National Genebank, BGI -Shenzhen, Shenzhen 518083, Guangdong, China.
| | - Shaohong Feng
- China National Genebank, BGI -Shenzhen, Shenzhen 518083, Guangdong, China.
| | - Shiping Liu
- China National Genebank, BGI -Shenzhen, Shenzhen 518083, Guangdong, China.
| | - Zijun Xiong
- China National Genebank, BGI -Shenzhen, Shenzhen 518083, Guangdong, China.
| | - Jaebum Kim
- Department of Stem Cell and Regenerative Biology, Konkuk University, Seoul 05029, Korea.
| | - Guojie Zhang
- China National Genebank, BGI -Shenzhen, Shenzhen 518083, Guangdong, China.
- Section for Ecology and Evolution, Department of Biology, Universitetsparken 15, University of Copenhagen, DK-2100 Copenhagen, Denmark.
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China.
| | - Lyudmila N Trut
- Institute of Cytology and Genetics of the Russian Academy of Sciences, Novosibirsk 630090, Russia.
| | - Denis M Larkin
- Department of Comparative Biomedical Science, Royal Veterinary College, London NW1 0TU, UK.
| | - Anna V Kukekova
- Department of Animal Science, College of Agricultural, Consumer and Environmental Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| |
Collapse
|
71
|
Bengtsson-Palme J, Larsson DGJ, Kristiansson E. Using metagenomics to investigate human and environmental resistomes. J Antimicrob Chemother 2018; 72:2690-2703. [PMID: 28673041 DOI: 10.1093/jac/dkx199] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Antibiotic resistance is a global health concern declared by the WHO as one of the largest threats to modern healthcare. In recent years, metagenomic DNA sequencing has started to be applied as a tool to study antibiotic resistance in different environments, including the human microbiota. However, a multitude of methods exist for metagenomic data analysis, and not all methods are suitable for the investigation of resistance genes, particularly if the desired outcome is an assessment of risks to human health. In this review, we outline the current state of methods for sequence handling, mapping to databases of resistance genes, statistical analysis and metagenomic assembly. In addition, we provide an overview of important considerations related to the analysis of resistance genes, and recommend some of the currently used tools and methods that are best equipped to inform research and clinical practice related to antibiotic resistance.
Collapse
Affiliation(s)
- Johan Bengtsson-Palme
- Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Guldhedsgatan 10, SE-41346, Gothenburg, Sweden.,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden
| | - D G Joakim Larsson
- Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Guldhedsgatan 10, SE-41346, Gothenburg, Sweden.,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden
| | - Erik Kristiansson
- Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden.,Department of Mathematical Sciences, Chalmers University of Technology, SE-41296, Gothenburg, Sweden
| |
Collapse
|
72
|
Nerli S, McShan AC, Sgourakis NG. Chemical shift-based methods in NMR structure determination. PROGRESS IN NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY 2018; 106-107:1-25. [PMID: 31047599 PMCID: PMC6788782 DOI: 10.1016/j.pnmrs.2018.03.002] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2018] [Revised: 03/09/2018] [Accepted: 03/09/2018] [Indexed: 05/08/2023]
Abstract
Chemical shifts are highly sensitive probes harnessed by NMR spectroscopists and structural biologists as conformational parameters to characterize a range of biological molecules. Traditionally, assignment of chemical shifts has been a labor-intensive process requiring numerous samples and a suite of multidimensional experiments. Over the past two decades, the development of complementary computational approaches has bolstered the analysis, interpretation and utilization of chemical shifts for elucidation of high resolution protein and nucleic acid structures. Here, we review the development and application of chemical shift-based methods for structure determination with a focus on ab initio fragment assembly, comparative modeling, oligomeric systems, and automated assignment methods. Throughout our discussion, we point out practical uses, as well as advantages and caveats, of using chemical shifts in structure modeling. We additionally highlight (i) hybrid methods that employ chemical shifts with other types of NMR restraints (residual dipolar couplings, paramagnetic relaxation enhancements and pseudocontact shifts) that allow for improved accuracy and resolution of generated 3D structures, (ii) the utilization of chemical shifts to model the structures of sparsely populated excited states, and (iii) modeling of sidechain conformations. Finally, we briefly discuss the advantages of contemporary methods that employ sparse NMR data recorded using site-specific isotope labeling schemes for chemical shift-driven structure determination of larger molecules. With this review, we aim to emphasize the accessibility and versatility of chemical shifts for structure determination of challenging biological systems, and to point out emerging areas of development that lead us towards the next generation of tools.
Collapse
Affiliation(s)
- Santrupti Nerli
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States; Department of Computer Science, University of California Santa Cruz, Santa Cruz, CA 95064, United States
| | - Andrew C McShan
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States
| | - Nikolaos G Sgourakis
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States.
| |
Collapse
|
73
|
Rahn R, Budach S, Costanza P, Ehrhardt M, Hancox J, Reinert K. Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading. Bioinformatics 2018; 34:3437-3445. [DOI: 10.1093/bioinformatics/bty380] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 05/02/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- René Rahn
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Stefan Budach
- Otto-Warburg-Laboratory, RNA Bioinformatics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | | | - Marcel Ehrhardt
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Jonny Hancox
- Health & Life Sciences, Intel Corporation, London, UK
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Otto-Warburg-Laboratory, RNA Bioinformatics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
74
|
Khan AR, Pervez MT, Babar ME, Naveed N, Shoaib M. A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective. Evol Bioinform Online 2018; 14:1176934318758650. [PMID: 29511353 PMCID: PMC5826002 DOI: 10.1177/1176934318758650] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 01/19/2018] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Current advancements in next-generation sequencing technology have made possible to sequence whole genome but assembling a large number of short sequence reads is still a big challenge. In this article, we present the comparative study of seven assemblers, namely, ABySS, Velvet, Edena, SGA, Ray, SSAKE, and Perga, using prokaryotic and eukaryotic paired-end as well as single-end data sets from Illumina platform. RESULTS Results showed that in case of single-end data sets, Velvet and ABySS outperformed in all the seven assemblers with comparatively low assembling time and high genome fraction. Velvet consumed the least amount of memory than any other assembler. In case of paired-end data sets, Velvet consumed least amount of time and produced high genome fraction after ABySS and Ray. In terms of low memory usage, SGA and Edena outperformed in all the assemblers. Ray also showed good genome fraction; however, extremely high assembling time consumed by the Ray might make it prohibitively slow on larger data sets of single and paired-end data. CONCLUSIONS Our comparison study will provide assistance to the scientists for selecting the suitable assembler according to their data sets and will also assist the developers to upgrade or develop a new assembler for de novo assembling.
Collapse
Affiliation(s)
- Abdul Rafay Khan
- Department of Bioinformatics and Computational Biology, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Tariq Pervez
- Department of Bioinformatics and Computational Biology, Virtual University of Pakistan, Lahore, Pakistan
| | | | - Nasir Naveed
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Shoaib
- Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
| |
Collapse
|
75
|
Abstract
In the absence of a reference genome, the ultimate goal of a de novo transcriptome assembly is to accurately and comprehensively reconstruct the set of messenger RNA transcripts represented in the sample. Non-reference assembly of the transcriptome of polyploid species poses a particular challenge because of the presence of homeologs that are difficult to disentangle at the sequence level. This is especially true for hexaploid oats, which have three highly similar subgenomes, two of which are thought to be nearly identical. Under these circumstances, most software packages and established pipelines encounter difficulties in rendering an accurate transcriptome because they are typically developed, refined, and tested for diploid organisms. We present a protocol for transcriptome assembly in oats that can be extended both to other polyploids and species with highly duplicated genomes.
Collapse
|
76
|
Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, Amselem J, Bouri L, Bocs S, Klopp C, Gibrat JF, Vlasova A, Leskosek BL, Soler L, Binzer-Panchal M, Lantz H. Ten steps to get started in Genome Assembly and Annotation. F1000Res 2018; 7. [PMID: 29568489 PMCID: PMC5850084 DOI: 10.12688/f1000research.13598.1] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/19/2018] [Indexed: 12/16/2022] Open
Abstract
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we present here 10 steps to facilitate researchers getting started in genome assembly and genome annotation. The guidelines given are broadly applicable, intended to be stable over time, and cover all aspects from start to finish of a general assembly and annotation project. Intrinsic properties of genomes are discussed, as is the importance of using high quality DNA. Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and encourage readers to also annotate transposable elements, something that is often omitted from annotation workflows. The importance of data management is stressed, and we give advice on where to submit data and how to make your results Findable, Accessible, Interoperable, and Reusable (FAIR).
Collapse
Affiliation(s)
| | - Erik Hjerde
- Department of Chemistry, Norstruct, UiT The Arctic University of Norway, Tromsø, 9019, Norway
| | - Lieven Sterck
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium.,VIB-UGent Center for Plant Systems Biology, Ghent University - VIB, Technologiepark 927, 9052 Ghent, Belgium
| | - Salvadors Capella-Gutierrez
- Spanish National Bioinformatics Institute (INB), Barcelona, Spain.,Barcelona Supercomputing Center (BSC), Centro Nacional de Supercomputación, Barcelona, Spain
| | - Cederic Notredame
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology , Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Olga Vinnere Pettersson
- Uppsala Genome Center, NGI/SciLifeLab, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-752 37 , Sweden
| | - Joelle Amselem
- URGI, INRA, Université Paris-Saclay, Versailles, 78026, France
| | - Laurent Bouri
- Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France
| | - Stephanie Bocs
- CIRAD, UMR AGAP, Montpellier, 34398, France.,AGAP, Cirad, INRA, Montpellier SupAgro, Universite Montpellier, Montpellier, France.,South Green Bioinformatics Platform, Montpellier, France
| | | | - Jean-Francois Gibrat
- Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France.,Unité de recherche , INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France
| | - Anna Vlasova
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Brane L Leskosek
- Faculty of Medicine, Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Lucile Soler
- IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
| | | | - Henrik Lantz
- IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
| |
Collapse
|
77
|
Automated NMR resonance assignments and structure determination using a minimal set of 4D spectra. Nat Commun 2018; 9:384. [PMID: 29374165 PMCID: PMC5786013 DOI: 10.1038/s41467-017-02592-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2017] [Accepted: 12/12/2017] [Indexed: 12/22/2022] Open
Abstract
Automated methods for NMR structure determination of proteins are continuously becoming more robust. However, current methods addressing larger, more complex targets rely on analyzing 6-10 complementary spectra, suggesting the need for alternative approaches. Here, we describe 4D-CHAINS/autoNOE-Rosetta, a complete pipeline for NOE-driven structure determination of medium- to larger-sized proteins. The 4D-CHAINS algorithm analyzes two 4D spectra recorded using a single, fully protonated protein sample in an iterative ansatz where common NOEs between different spin systems supplement conventional through-bond connectivities to establish assignments of sidechain and backbone resonances at high levels of completeness and with a minimum error rate. The 4D-CHAINS assignments are then used to guide automated assignment of long-range NOEs and structure refinement in autoNOE-Rosetta. Our results on four targets ranging in size from 15.5 to 27.3 kDa illustrate that the structures of proteins can be determined accurately and in an unsupervised manner in a matter of days.
Collapse
|
78
|
Acuña-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics 2018; 19:54. [PMID: 29338683 PMCID: PMC5771137 DOI: 10.1186/s12864-017-4429-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 12/29/2017] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Without knowledge of their genomic sequences, it is impossible to make functional models of the bacteria that make up human and animal microbiota. Unfortunately, the vast majority of publicly available genomes are only working drafts, an incompleteness that causes numerous problems and constitutes a major obstacle to genotypic and phenotypic interpretation. In this work, we began with an example from the class Bacteroidia in the phylum Bacteroidetes, which is preponderant among human orodigestive microbiota. We successfully identify the genetic loci responsible for assembly breaks and misassemblies and demonstrate the importance and usefulness of long-read sequencing and curated reannotation. RESULTS We showed that the fragmentation in Bacteroidia draft genomes assembled from massively parallel sequencing linearly correlates with genomic repeats of the same or greater size than the reads. We also demonstrated that some of these repeats, especially the long ones, correspond to misassembled loci in three reference Porphyromonas gingivalis genomes marked as circularized (thus complete or finished). We prove that even at modest coverage (30X), long-read resequencing together with PCR contiguity verification (rrn operons and an integrative and conjugative element or ICE) can be used to identify and correct the wrongly combined or assembled regions. Finally, although time-consuming and labor-intensive, consistent manual biocuration of three P. gingivalis strains allowed us to compare and correct the existing genomic annotations, resulting in a more accurate interpretation of the genomic differences among these strains. CONCLUSIONS In this study, we demonstrate the usefulness and importance of long-read sequencing in verifying published genomes (even when complete) and generating assemblies for new bacterial strains/species with high genomic plasticity. We also show that when combined with biological validation processes and diligent biocurated annotation, this strategy helps reduce the propagation of errors in shared databases, thus limiting false conclusions based on incomplete or misleading information.
Collapse
Affiliation(s)
- Luis Acuña-Amador
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.,Laboratorio de Investigación en Bacteriología Anaerobia, Centro de Investigación en Enfermedades Tropicales, Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica
| | - Aline Primot
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Edouard Cadieu
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Alain Roulet
- GenoToul Genome & Transcriptome (GeT-PlaGe), INRA, US1426, Castanet-Tolosan, France
| | - Frédérique Barloy-Hubler
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.
| |
Collapse
|
79
|
Advances in Sequencing and Resequencing in Crop Plants. ADVANCES IN BIOCHEMICAL ENGINEERING/BIOTECHNOLOGY 2018. [PMID: 29516115 DOI: 10.1007/10_2017_46] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
DNA sequencing technologies have changed the face of biological research over the last 20 years. From reference genomes to population level resequencing studies, these technologies have made significant contributions to our understanding of plant biology and evolution. As the technologies have increased in power, the breadth and complexity of the questions that can be asked has increased. Along with this, the challenges of managing unprecedented quantities of sequence data are mounting. This chapter describes a few aspects of the journey so far and looks forward to what may lie ahead. Graphical Abstract.
Collapse
|
80
|
Pool deconvolution approach for high-throughput gene mining from Bacillus thuringiensis. Appl Microbiol Biotechnol 2017; 102:1467-1482. [DOI: 10.1007/s00253-017-8633-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Revised: 10/24/2017] [Accepted: 11/05/2017] [Indexed: 11/27/2022]
|
81
|
Luo R, Sedlazeck FJ, Darby CA, Kelly SM, Schatz MC. LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning. Comput Struct Biotechnol J 2017; 15:478-484. [PMID: 29213995 PMCID: PMC5711661 DOI: 10.1016/j.csbj.2017.10.002] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Revised: 10/25/2017] [Accepted: 10/31/2017] [Indexed: 01/27/2023] Open
Abstract
Linked-read sequencing, using highly-multiplexed genome partitioning and barcoding, can span hundreds of kilobases to improve de novo assembly, haplotype phasing, and other applications. Based on our analysis of 14 datasets, we introduce LRSim that simulates linked-reads by emulating the library preparation and sequencing process with fine control over variants, linked-read characteristics, and the short-read profile. We conclude from the phasing and assembly of multiple datasets, recommendations on coverage, fragment length, and partitioning when sequencing genomes of different sizes and complexities. These optimizations improve results by orders of magnitude, and enable the development of novel methods. LRSim is available at https://github.com/aquaskyline/LRSIM.
Collapse
Affiliation(s)
- Ruibang Luo
- Department of Computer Science, Johns Hopkins University, United States.,Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, United States
| | - Fritz J Sedlazeck
- Department of Computer Science, Johns Hopkins University, United States
| | - Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, United States
| | - Stephen M Kelly
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, United States
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, United States.,Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, United States.,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
| |
Collapse
|
82
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 259] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
83
|
White DJ, Wang J, Hall RJ. Assessing the Impact of Assemblers on Virus Detection in a De Novo Metagenomic Analysis Pipeline. J Comput Biol 2017; 24:874-881. [PMID: 28414526 PMCID: PMC5610382 DOI: 10.1089/cmb.2017.0008] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.
Collapse
Affiliation(s)
| | - Jing Wang
- Institute of Environmental Science and Research at the National Centre for Biosecurity and Infectious Disease, Upper Hutt, New Zealand
| | - Richard J. Hall
- Animal Health Laboratory, Investigation and Diagnostic Centres and Response, Ministry for Primary Industries—Manatū Ahu Matua, Upper Hutt, New Zealand
| |
Collapse
|
84
|
Jünemann S, Kleinbölting N, Jaenicke S, Henke C, Hassa J, Nelkner J, Stolze Y, Albaum SP, Schlüter A, Goesmann A, Sczyrba A, Stoye J. Bioinformatics for NGS-based metagenomics and the application to biogas research. J Biotechnol 2017; 261:10-23. [PMID: 28823476 DOI: 10.1016/j.jbiotec.2017.08.012] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Revised: 08/08/2017] [Accepted: 08/09/2017] [Indexed: 12/19/2022]
Abstract
Metagenomics has proven to be one of the most important research fields for microbial ecology during the last decade. Starting from 16S rRNA marker gene analysis for the characterization of community compositions to whole metagenome shotgun sequencing which additionally allows for functional analysis, metagenomics has been applied in a wide spectrum of research areas. The cost reduction paired with the increase in the amount of data due to the advent of next-generation sequencing led to a rapidly growing demand for bioinformatic software in metagenomics. By now, a large number of tools that can be used to analyze metagenomic datasets has been developed. The Bielefeld-Gießen center for microbial bioinformatics as part of the German Network for Bioinformatics Infrastructure bundles and imparts expert knowledge in the analysis of metagenomic datasets, especially in research on microbial communities involved in anaerobic digestion residing in biogas reactors. In this review, we give an overview of the field of metagenomics, introduce into important bioinformatic tools and possible workflows, accompanied by application examples of biogas surveys successfully conducted at the Center for Biotechnology of Bielefeld University.
Collapse
Affiliation(s)
- Sebastian Jünemann
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| | - Nils Kleinbölting
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Sebastian Jaenicke
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Bioinformatics and Systems Biology, Justus-Liebig-Universität, Gießen, Germany
| | - Christian Henke
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Julia Hassa
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Johanna Nelkner
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Yvonne Stolze
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Stefan P Albaum
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Andreas Schlüter
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-Universität, Gießen, Germany
| | - Alexander Sczyrba
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
85
|
Shi W, Ji P, Zhao F. The combination of direct and paired link graphs can boost repetitive genome assembly. Nucleic Acids Res 2017; 45:e43. [PMID: 27924003 PMCID: PMC5399794 DOI: 10.1093/nar/gkw1191] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2016] [Accepted: 11/17/2016] [Indexed: 11/14/2022] Open
Abstract
Currently, most paired link based scaffolding algorithms intrinsically mask the sequences between two linked contigs and bypass their direct link information embedded in the original de Bruijn assembly graph. Such disadvantage substantially complicates the scaffolding process and leads to the inability of resolving repetitive contig assembly. Here we present a novel algorithm, inGAP-sf, for effectively generating high-quality and continuous scaffolds. inGAP-sf achieves this by using a new strategy based on the combination of direct link and paired link graphs, in which direct link is used to increase graph connectivity and to decrease graph complexity and paired link is employed to supervise the traversing process on the direct link graph. Such advantage greatly facilitates the assembly of short-repeat enriched regions. Moreover, a new comprehensive decision model is developed to eliminate the noise routes accompanying with the introduced direct link. Through extensive evaluations on both simulated and real datasets, we demonstrated that inGAP-sf outperforms most of the genome scaffolding algorithms by generating more accurate and continuous assembly, especially for short repetitive regions.
Collapse
Affiliation(s)
- Wenyu Shi
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Peifeng Ji
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Fangqing Zhao
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
86
|
Zadesenets KS, Ershov NI, Rubtsov NB. Whole-genome sequencing of eukaryotes: From sequencing of DNA fragments to a genome assembly. RUSS J GENET+ 2017. [DOI: 10.1134/s102279541705012x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
87
|
Baichoo S, Ouzounis CA. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 2017; 156-157:72-85. [PMID: 28392341 DOI: 10.1016/j.biosystems.2017.03.003] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 03/21/2017] [Accepted: 03/22/2017] [Indexed: 12/12/2022]
Abstract
A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality.
Collapse
Affiliation(s)
- Shakuntala Baichoo
- Department of Computer Science & Engineering, University of Mauritius, Réduit 80837, Mauritius.
| | - Christos A Ouzounis
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica 57001, Greece.
| |
Collapse
|
88
|
Abstract
Most reconstruction methods for genomes of ancient origin that are used today require a closely related reference. In order to identify genomic rearrangements or the deletion of whole genes, de novo assembly has to be used. However, because of inherent problems with ancient DNA, its de novo assembly is highly complicated. In order to tackle the diversity in the length of the input reads, we propose a two-layer approach, where multiple assemblies are generated in the first layer, which are then combined in the second layer. We used this two-layer assembly to generate assemblies for two different ancient samples and compared the results to current de novo assembly approaches. We are able to improve the assembly with respect to the length of the contigs and can resolve more repetitive regions.
Collapse
Affiliation(s)
- Alexander Seitz
- Center for Bioinformatics (ZBIT), Integrative Transcriptomics, Eberhard-Karls-Universität Tübingen , Tübingen , Germany
| | - Kay Nieselt
- Center for Bioinformatics (ZBIT), Integrative Transcriptomics, Eberhard-Karls-Universität Tübingen , Tübingen , Germany
| |
Collapse
|
89
|
Vollmers J, Wiegand S, Kaster AK. Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters! PLoS One 2017; 12:e0169662. [PMID: 28099457 PMCID: PMC5242441 DOI: 10.1371/journal.pone.0169662] [Citation(s) in RCA: 122] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Accepted: 12/20/2016] [Indexed: 12/20/2022] Open
Abstract
With the constant improvement in cost-efficiency and quality of Next Generation Sequencing technologies, shotgun-sequencing approaches -such as metagenomics- have nowadays become the methods of choice for studying and classifying microorganisms from various habitats. The production of data has dramatically increased over the past years and processing and analysis steps are becoming more and more of a bottleneck. Limiting factors are partly the availability of computational resources, but mainly the bioinformatics expertise in establishing and applying appropriate processing and analysis pipelines. Fortunately, a large diversity of specialized software tools is nowadays available. Nevertheless, choosing the most appropriate methods for answering specific biological questions can be rather challenging, especially for non-bioinformaticians. In order to provide a comprehensive overview and guide for the microbiological scientific community, we assessed the most common and freely available metagenome assembly tools with respect to their output statistics, their sensitivity for low abundant community members and variability in resulting community profiles as well as their ease-of-use. In contrast to the highly anticipated "Critical Assessment of Metagenomic Interpretation" (CAMI) challenge, which uses general mock community-based assembler comparison we here tested assemblers on real Illumina metagenome sequencing data from natural communities of varying complexity sampled from forest soil and algal biofilms. Our observations clearly demonstrate that different assembly tools can prove optimal, depending on the sample type, available computational resources and, most importantly, the specific research goal. In addition, we present detailed descriptions of the underlying principles and pitfalls of publically available assembly tools from a microbiologist's perspective, and provide guidance regarding the user-friendliness, sensitivity and reliability of the resulting phylogenetic profiles.
Collapse
Affiliation(s)
- John Vollmers
- Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| | - Sandra Wiegand
- Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| | - Anne-Kristin Kaster
- Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| |
Collapse
|
90
|
Miga KH. The Promises and Challenges of Genomic Studies of Human Centromeres. PROGRESS IN MOLECULAR AND SUBCELLULAR BIOLOGY 2017; 56:285-304. [PMID: 28840242 DOI: 10.1007/978-3-319-58592-5_12] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Human centromeres are genomic regions that act as sites of kinetochore assembly to ensure proper chromosome segregation during mitosis and meiosis. Although the biological importance of centromeres in genome stability, and ultimately, cell viability are well understood, the complete sequence content and organization in these multi-megabase-sized regions remains unknown. The lack of a high-resolution reference assembly inhibits standard bioinformatics protocols, and as a result, sequence-based studies involving human centromeres lag far behind the advances made for the non-repetitive sequences in the human genome. In this chapter, I introduce what is known about the genomic organization in the highly repetitive regions spanning human centromeres, and discuss the challenges these sequences pose for assembly, alignment, and data interpretation. Overcoming these obstacles is expected to issue a new era for centromere genomics, which will offer new discoveries in basic cell biology and human biomedical research.
Collapse
Affiliation(s)
- Karen H Miga
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
91
|
Minio A, Lin J, Gaut BS, Cantu D. How Single Molecule Real-Time Sequencing and Haplotype Phasing Have Enabled Reference-Grade Diploid Genome Assembly of Wine Grapes. FRONTIERS IN PLANT SCIENCE 2017; 8:826. [PMID: 28567052 PMCID: PMC5434136 DOI: 10.3389/fpls.2017.00826] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2017] [Accepted: 05/02/2017] [Indexed: 05/23/2023]
Affiliation(s)
- Andrea Minio
- Department of Viticulture and Enology, University of California, DavisDavis, CA, United States
| | - Jerry Lin
- Department of Viticulture and Enology, University of California, DavisDavis, CA, United States
| | - Brandon S. Gaut
- Department of Ecology and Evolutionary Biology, University of California, IrvineIrvine, CA, United States
| | - Dario Cantu
- Department of Viticulture and Enology, University of California, DavisDavis, CA, United States
- *Correspondence: Dario Cantu
| |
Collapse
|
92
|
Abstract
The recent breakthroughs in assembling long error-prone reads were based on the overlap-layout-consensus (OLC) approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the OLC approach is the only practical paradigm for assembling long error-prone reads. We show how to generalize de Bruijn graphs for assembling long error-prone reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions.
Collapse
|
93
|
Rastrojo A, Alcamí A. Aquatic viral metagenomics: Lights and shadows. Virus Res 2016; 239:87-96. [PMID: 27889617 DOI: 10.1016/j.virusres.2016.11.021] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2016] [Accepted: 11/18/2016] [Indexed: 01/02/2023]
Abstract
Viruses are the most abundant biological entities on Earth, exceeding bacteria in most of the ecosystems. Specially in oceans, viruses are thought to be the major planktonic predators shaping microorganism communities and controlling ocean biological capacity. Plankton lysis by viruses plays an important role in ocean nutrient and energy cycles. Viral metagenomics has emerged as a powerful tool to uncover viral diversity in aquatic ecosystems through the use of Next Generation Sequencing. However, many of the commonly used viral sample preparation steps have several important biases that must be considered to avoid a misinterpretation of the results. In addition to biases caused by the purification of virus particles, viral DNA/RNA amplification and the preparation of genomic libraries could also introduce biases, and a detailed knowledge about such protocols is required. In this review, the main steps in the viral metagenomic workflow are described paying special attention to the potential biases introduced by each one.
Collapse
Affiliation(s)
- Alberto Rastrojo
- Centro de Biología Molecular Severo Ochoa (Consejo Superior de Investigaciones Científicas y Universidad Autónoma de Madrid), Madrid, Spain
| | - Antonio Alcamí
- Centro de Biología Molecular Severo Ochoa (Consejo Superior de Investigaciones Científicas y Universidad Autónoma de Madrid), Madrid, Spain.
| |
Collapse
|
94
|
Guizelini D, Raittz RT, Cruz LM, Souza EM, Steffens MBR, Pedrosa FO. GFinisher: a new strategy to refine and finish bacterial genome assemblies. Sci Rep 2016; 6:34963. [PMID: 27721396 PMCID: PMC5056350 DOI: 10.1038/srep34963] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Accepted: 09/20/2016] [Indexed: 01/10/2023] Open
Abstract
Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.
Collapse
Affiliation(s)
- Dieval Guizelini
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Roberto T Raittz
- Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Leonardo M Cruz
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Emanuel M Souza
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Maria B R Steffens
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Fabio O Pedrosa
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| |
Collapse
|
95
|
Evolutionary trajectories of snake genes and genomes revealed by comparative analyses of five-pacer viper. Nat Commun 2016; 7:13107. [PMID: 27708285 PMCID: PMC5059746 DOI: 10.1038/ncomms13107] [Citation(s) in RCA: 68] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Accepted: 09/02/2016] [Indexed: 12/29/2022] Open
Abstract
Snakes have numerous features distinctive from other tetrapods and a rich history of genome evolution that is still obscure. Here, we report the high-quality genome of the five-pacer viper, Deinagkistrodon acutus, and comparative analyses with other representative snake and lizard genomes. We map the evolutionary trajectories of transposable elements (TEs), developmental genes and sex chromosomes onto the snake phylogeny. TEs exhibit dynamic lineage-specific expansion, and many viper TEs show brain-specific gene expression along with their nearby genes. We detect signatures of adaptive evolution in olfactory, venom and thermal-sensing genes and also functional degeneration of genes associated with vision and hearing. Lineage-specific relaxation of functional constraints on respective Hox and Tbx limb-patterning genes supports fossil evidence for a successive loss of forelimbs then hindlimbs during snake evolution. Finally, we infer that the ZW sex chromosome pair had undergone at least three recombination suppression events in the ancestor of advanced snakes. These results altogether forge a framework for our deep understanding into snakes' history of molecular evolution. Snakes have many characteristics that distinguish them from their relatives. Here, Yin et al. sequence the genome of the five-pacer viper, Deinagkistrodon acutus, and use comparative genomic analyses to elucidate the evolution of transposable elements, developmental genes and sex chromosomes in snakes.
Collapse
|
96
|
Liu B, Liu CM, Li D, Li Y, Ting HF, Yiu SM, Luo R, Lam TW. BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics 2016; 17 Suppl 5:499. [PMID: 27586129 PMCID: PMC5009518 DOI: 10.1186/s12864-016-2829-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Background De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.
Collapse
Affiliation(s)
- Binghang Liu
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Chi-Man Liu
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Dinghua Li
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Yingrui Li
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Hing-Fung Ting
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Siu-Ming Yiu
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Ruibang Luo
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
| | - Tak-Wah Lam
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
| |
Collapse
|
97
|
MetLab: An In Silico Experimental Design, Simulation and Analysis Tool for Viral Metagenomics Studies. PLoS One 2016; 11:e0160334. [PMID: 27479078 PMCID: PMC4968819 DOI: 10.1371/journal.pone.0160334] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 07/18/2016] [Indexed: 02/07/2023] Open
Abstract
Metagenomics, the sequence characterization of all genomes within a sample, is widely used as a virus discovery tool as well as a tool to study viral diversity of animals. Metagenomics can be considered to have three main steps; sample collection and preparation, sequencing and finally bioinformatics. Bioinformatic analysis of metagenomic datasets is in itself a complex process, involving few standardized methodologies, thereby hampering comparison of metagenomics studies between research groups. In this publication the new bioinformatics framework MetLab is presented, aimed at providing scientists with an integrated tool for experimental design and analysis of viral metagenomes. MetLab provides support in designing the metagenomics experiment by estimating the sequencing depth needed for the complete coverage of a species. This is achieved by applying a methodology to calculate the probability of coverage using an adaptation of Stevens’ theorem. It also provides scientists with several pipelines aimed at simplifying the analysis of viral metagenomes, including; quality control, assembly and taxonomic binning. We also implement a tool for simulating metagenomics datasets from several sequencing platforms. The overall aim is to provide virologists with an easy to use tool for designing, simulating and analyzing viral metagenomes. The results presented here include a benchmark towards other existing software, with emphasis on detection of viruses as well as speed of applications. This is packaged, as comprehensive software, readily available for Linux and OSX users at https://github.com/norling/metlab.
Collapse
|
98
|
Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine. Pharmaceutics 2016; 8:E15. [PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 03/11/2016] [Accepted: 04/06/2016] [Indexed: 01/15/2023] Open
Abstract
Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.
Collapse
Affiliation(s)
- Wenming Xiao
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Leihong Wu
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Gokhan Yavas
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
| | - Baitang Ning
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
99
|
Huang KW, Chen JL, Yang CS, Tsai CW. A memetic gravitation search algorithm for solving DNA fragment assembly problems. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/ifs-151994] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ko-Wei Huang
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
- Department of Psychology, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Jui-Le Chen
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
- Department of Computer Science and Entertainment Technology, Tajen university, Pingtung, Taiwan, R.O.C
| | - Chu-Sing Yang
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Chun-Wei Tsai
- Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan, R.O.C
| |
Collapse
|
100
|
Moreton J, Izquierdo A, Emes RD. Assembly, Assessment, and Availability of De novo Generated Eukaryotic Transcriptomes. Front Genet 2016; 6:361. [PMID: 26793234 PMCID: PMC4707302 DOI: 10.3389/fgene.2015.00361] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2015] [Accepted: 12/19/2015] [Indexed: 11/13/2022] Open
Abstract
De novo assembly of a complete transcriptome without the need for a guiding reference genome is attractive, particularly where the cost and complexity of generating a eukaryote genome is prohibitive. The transcriptome should not however be seen as just a quick and cheap alternative to building a complete genome. Transcriptomics allows the understanding and comparison of spatial and temporal samples within an organism, and allows surveying of multiple individuals or closely related species. De novo assembly in theory allows the building of a complete transcriptome without any prior knowledge of the genome. It also allows the discovery of alternate splice forms of coding RNAs and also non-coding RNAs, which are often missed by proteomic approaches, or are incompletely annotated in genome studies. The limitations of the method are that the generation of a truly complete assembly is unlikely, and so we require some methods for the assessment of the quality and appropriateness of a generated transcriptome. Whilst no single consensus pipeline or tool is agreed as optimal, various algorithms, and easy to use software do exist making transcriptome generation a more common approach. With this expansion of data, questions still exist relating to how do we make these datasets fully discoverable, comparable and most useful to understand complex biological systems?
Collapse
Affiliation(s)
- Joanna Moreton
- Advanced Data Analysis Centre, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
- School of Veterinary Medicine and Science, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
| | - Abril Izquierdo
- School of Veterinary Medicine and Science, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
| | - Richard D. Emes
- Advanced Data Analysis Centre, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
- School of Veterinary Medicine and Science, Sutton Bonington Campus, University of NottinghamLeicestershire, UK
| |
Collapse
|