1
|
Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, Sidhu P, Warren RL, Birol I. Linear time complexity de novo long read genome assembly with GoldRush. Nat Commun 2023; 14:2906. [PMID: 37217507 PMCID: PMC10202940 DOI: 10.1038/s41467-023-38716-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 05/11/2023] [Indexed: 05/24/2023] Open
Abstract
Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap - its most costly step - was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Puneet Sidhu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
2
|
Naranjo-Ortiz MA, Molina M, Fuentes D, Mixão V, Gabaldón T. Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other nonstandard architectures in genome assemblies. Gigascience 2022; 11:6751106. [PMID: 36205401 PMCID: PMC9540331 DOI: 10.1093/gigascience/giac088] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Revised: 11/23/2021] [Accepted: 08/24/2022] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Recent technological developments have made genome sequencing and assembly highly accessible and widely used. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, heterokaryosis, or extreme compositional biases can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a nonnegligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors. FINDINGS Here we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of nonchanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species. CONCLUSIONS Our results show that 10 (28.57%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least for some lineages within the Fungi.
Collapse
Affiliation(s)
- Miguel A Naranjo-Ortiz
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain,Biology Department, Clark University, Worcester, MA 01610, USA,Naturhistoriskmuseum, University of Oslo, Oslo 0562, Norway
| | - Manu Molina
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain,Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain
| | - Diego Fuentes
- Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Verónica Mixão
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain,Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Toni Gabaldón
- Correspondence address. Toni Gabaldón, Plaça Eusebi Güell, 1-3, Barcelona 08034, Spain. E-mail:
| |
Collapse
|
3
|
Gupta AK, Kumar M. Benchmarking and Assessment of Eight De Novo Genome Assemblers on Viral Next-Generation Sequencing Data, Including the SARS-CoV-2. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:372-381. [PMID: 35759429 DOI: 10.1089/omi.2022.0042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Viral genomics has become crucial in clinical diagnostics and ecology, not to mention to stem the COVID-19 pandemic. Whole-genome sequencing (WGS) is pivotal in gaining an improved understanding of viral evolution, genomic epidemiology, infectious outbreaks, pathobiology, clinical management, and vaccine development. Genome assembly is one of the crucial steps in WGS data analyses. A series of different assemblers has been developed with the advent of high-throughput next-generation sequencing (NGS). Various studies have reported the evaluation of these assembly tools on distinct datasets; however, these lack data from viral origin. In this study, we performed a comparative evaluation and benchmarking of eight de novo assemblers: SOAPdenovo, Velvet, assembly by short sequences (ABySS), iterative De Bruijn graph assembler (IDBA), SPAdes, Edena, iterative virus assembler, and VICUNA on the viral NGS data from distinct Illumina (GAIIx, Hiseq, Miseq, and Nextseq) platforms. WGS data of diverse viruses, that is, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), dengue virus 3, human immunodeficiency virus 1, hepatitis B virus, human herpesvirus 8, human papillomavirus 16, rhinovirus A, and West Nile virus, were utilized to assess these assemblers. Performance metrics such as genome fraction recovery, assembly lengths, NG50, N50, contig length, contig numbers, mismatches, and misassemblies were analyzed. Overall, three assemblers, that is, SPAdes, IDBA, and ABySS, performed consistently well, including for genome assembly of SARS-CoV-2. These assembly methods should be considered and recommended for future studies of viruses. The study also suggests that implementing two or more assembly approaches should be considered in viral NGS studies, especially in clinical settings. Taken together, the benchmarking of eight de novo genome assemblers reported in this study can inform future public health and ecology research concerning the viruses, the COVID-19 pandemic, and viral outbreaks.
Collapse
Affiliation(s)
- Amit Kumar Gupta
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Chandigarh, India
| | - Manoj Kumar
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Chandigarh, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| |
Collapse
|
4
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
5
|
Nisar H, Wajid B, Shahid S, Anwar F, Wajid I, Khatoon A, Sattar MU, Sadaf S. Whole-genome sequencing as a first-tier diagnostic framework for rare genetic diseases. Exp Biol Med (Maywood) 2021; 246:2610-2617. [PMID: 34521224 DOI: 10.1177/15353702211040046] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Rare diseases affect nearly 300 million people globally with most patients aged five or less. Traditional diagnostic approaches have provided much of the diagnosis; however, there are limitations. For instance, simply inadequate and untimely diagnosis adversely affects both the patient and their families. This review advocates the use of whole genome sequencing in clinical settings for diagnosis of rare genetic diseases by showcasing five case studies. These examples specifically describe the utilization of whole genome sequencing, which helped in providing relief to patients via correct diagnosis followed by use of precision medicine.
Collapse
Affiliation(s)
- Haseeb Nisar
- Office of Research, Innovation and Commercialization, University of Management and Technology, Lahore 54000, Pakistan.,School of Biochemistry & Biotechnology, University of the Punjab, Lahore 54000, Pakistan
| | - Bilal Wajid
- Department of Electrical Engineering, University of Engineering and Technology, Lahore 54000, Pakistan.,Ibn Sina Research & Development Division, Sabz-Qalam, Lahore 54000, Pakistan.,Department of Computer Sciences, University of Management and Technology, Lahore 54000, Pakistan
| | - Samiah Shahid
- Institute of Molecular Biology and Biotechnology, The University of Lahore, Lahore 54000, Pakistan
| | - Faria Anwar
- Out Patient Department, Mayo Hospital, Lahore 54000, Pakistan
| | - Imran Wajid
- Ibn Sina Research & Development Division, Sabz-Qalam, Lahore 54000, Pakistan
| | - Asia Khatoon
- School of Biochemistry & Biotechnology, University of the Punjab, Lahore 54000, Pakistan
| | - Mian Usman Sattar
- Institute of Social Sciences, Istanbul Commerce University, Istanbul, Turkey
| | - Saima Sadaf
- School of Biochemistry & Biotechnology, University of the Punjab, Lahore 54000, Pakistan
| |
Collapse
|
6
|
Dida F, Yi G. Empirical evaluation of methods for de novo genome assembly. PeerJ Comput Sci 2021; 7:e636. [PMID: 34307867 PMCID: PMC8279138 DOI: 10.7717/peerj-cs.636] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 06/19/2021] [Indexed: 06/12/2023]
Abstract
Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm's performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms.
Collapse
Affiliation(s)
- Firaol Dida
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| |
Collapse
|
7
|
Gabbassov E, Moreno-Molina M, Comas I, Libbrecht M, Chindelevitch L. SplitStrains, a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGS data. Microb Genom 2021; 7. [PMID: 34165419 PMCID: PMC8461467 DOI: 10.1099/mgen.0.000607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
The occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. difficile within a single human host, referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detecting it, and especially determining the proportion and identities of the underlying strains, from WGS (whole-genome sequencing) data, have been limited. In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the underlying strains. We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for data coming from bacterial pathogens and holds the promise of enabling previously inaccessible conclusions to be drawn in the realm of public health microbiology.
Collapse
Affiliation(s)
- Einar Gabbassov
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
- *Correspondence: Einar Gabbassov,
| | | | - Iñaki Comas
- Instituto de Biomedicina de Valencia, Valencia, Spain
| | - Maxwell Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Leonid Chindelevitch
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College, London, UK
- *Correspondence: Leonid Chindelevitch,
| |
Collapse
|
8
|
Collins JH, Keating KW, Jones TR, Balaji S, Marsan CB, Çomo M, Newlon ZJ, Mitchell T, Bartley B, Adler A, Roehner N, Young EM. Engineered yeast genomes accurately assembled from pure and mixed samples. Nat Commun 2021; 12:1485. [PMID: 33674578 PMCID: PMC7935868 DOI: 10.1038/s41467-021-21656-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Accepted: 02/04/2021] [Indexed: 01/31/2023] Open
Abstract
Yeast whole genome sequencing (WGS) lacks end-to-end workflows that identify genetic engineering. Here we present Prymetime, a tool that assembles yeast plasmids and chromosomes and annotates genetic engineering sequences. It is a hybrid workflow-it uses short and long reads as inputs to perform separate linear and circular assembly steps. This structure is necessary to accurately resolve genetic engineering sequences in plasmids and the genome. We show this by assembling diverse engineered yeasts, in some cases revealing unintended deletions and integrations. Furthermore, the resulting whole genomes are high quality, although the underlying assembly software does not consistently resolve highly repetitive genome features. Finally, we assemble plasmids and genome integrations from metagenomic sequencing, even with 1 engineered cell in 1000. This work is a blueprint for building WGS workflows and establishes WGS-based identification of yeast genetic engineering.
Collapse
Affiliation(s)
- Joseph H Collins
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Kevin W Keating
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Trent R Jones
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Shravani Balaji
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Celeste B Marsan
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Marina Çomo
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Zachary J Newlon
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Tom Mitchell
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Bryan Bartley
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Aaron Adler
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Nicholas Roehner
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Eric M Young
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA.
| |
Collapse
|
9
|
|
10
|
Muggia L, Ametrano CG, Sterflinger K, Tesei D. An Overview of Genomics, Phylogenomics and Proteomics Approaches in Ascomycota. Life (Basel) 2020; 10:E356. [PMID: 33348904 PMCID: PMC7765829 DOI: 10.3390/life10120356] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 12/10/2020] [Accepted: 12/12/2020] [Indexed: 12/26/2022] Open
Abstract
Fungi are among the most successful eukaryotes on Earth: they have evolved strategies to survive in the most diverse environments and stressful conditions and have been selected and exploited for multiple aims by humans. The characteristic features intrinsic of Fungi have required evolutionary changes and adaptations at deep molecular levels. Omics approaches, nowadays including genomics, metagenomics, phylogenomics, transcriptomics, metabolomics, and proteomics have enormously advanced the way to understand fungal diversity at diverse taxonomic levels, under changeable conditions and in still under-investigated environments. These approaches can be applied both on environmental communities and on individual organisms, either in nature or in axenic culture and have led the traditional morphology-based fungal systematic to increasingly implement molecular-based approaches. The advent of next-generation sequencing technologies was key to boost advances in fungal genomics and proteomics research. Much effort has also been directed towards the development of methodologies for optimal genomic DNA and protein extraction and separation. To date, the amount of proteomics investigations in Ascomycetes exceeds those carried out in any other fungal group. This is primarily due to the preponderance of their involvement in plant and animal diseases and multiple industrial applications, and therefore the need to understand the biological basis of the infectious process to develop mechanisms for biologic control, as well as to detect key proteins with roles in stress survival. Here we chose to present an overview as much comprehensive as possible of the major advances, mainly of the past decade, in the fields of genomics (including phylogenomics) and proteomics of Ascomycota, focusing particularly on those reporting on opportunistic pathogenic, extremophilic, polyextremotolerant and lichenized fungi. We also present a review of the mostly used genome sequencing technologies and methods for DNA sequence and protein analyses applied so far for fungi.
Collapse
Affiliation(s)
- Lucia Muggia
- Department of Life Sciences, University of Trieste, 34127 Trieste, Italy
| | - Claudio G. Ametrano
- Grainger Bioinformatics Center, Department of Science and Education, The Field Museum, Chicago, IL 60605, USA;
| | - Katja Sterflinger
- Academy of Fine Arts Vienna, Institute of Natual Sciences and Technology in the Arts, 1090 Vienna, Austria;
| | - Donatella Tesei
- Department of Biotechnology, University of Natural Resources and Life Sciences, 1190 Vienna, Austria;
| |
Collapse
|
11
|
Medvedev P. Modeling biological problems in computer science: a case study in genome assembly. Brief Bioinform 2020; 20:1376-1383. [PMID: 29394324 DOI: 10.1093/bib/bby003] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Revised: 12/07/2017] [Indexed: 11/14/2022] Open
Abstract
As computer scientists working in bioinformatics/computational biology, we often face the challenge of coming up with an algorithm to answer a biological question. This occurs in many areas, such as variant calling, alignment and assembly. In this tutorial, we use the example of the genome assembly problem to demonstrate how to go from a question in the biological realm to a solution in the computer science realm. We show the modeling process step-by-step, including all the intermediate failed attempts. Please note this is not an introduction to how genome assembly algorithms work and, if treated as such, would be incomplete and unnecessarily long-winded.
Collapse
|
12
|
|
13
|
Li F, Zhao X, Li M, He K, Huang C, Zhou Y, Li Z, Walters JR. Insect genomes: progress and challenges. INSECT MOLECULAR BIOLOGY 2019; 28:739-758. [PMID: 31120160 DOI: 10.1111/imb.12599] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Revised: 03/22/2019] [Accepted: 05/14/2019] [Indexed: 05/24/2023]
Abstract
In the wake of constant improvements in sequencing technologies, numerous insect genomes have been sequenced. Currently, 1219 insect genome-sequencing projects have been registered with the National Center for Biotechnology Information, including 401 that have genome assemblies and 155 with an official gene set of annotated protein-coding genes. Comparative genomics analysis showed that the expansion or contraction of gene families was associated with well-studied physiological traits such as immune system, metabolic detoxification, parasitism and polyphagy in insects. Here, we summarize the progress of insect genome sequencing, with an emphasis on how this impacts research on pest control. We begin with a brief introduction to the basic concepts of genome assembly, annotation and metrics for evaluating the quality of draft assemblies. We then provide an overview of genome information for numerous insect species, highlighting examples from prominent model organisms, agricultural pests and disease vectors. We also introduce the major insect genome databases. The increasing availability of insect genomic resources is beneficial for developing alternative pest control methods. However, many opportunities remain for developing data-mining tools that make maximal use of the available insect genome resources. Although rapid progress has been achieved, many challenges remain in the field of insect genomics.
Collapse
Affiliation(s)
- F Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - X Zhao
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - M Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - K He
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - C Huang
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - Y Zhou
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - Z Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - J R Walters
- Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA
| |
Collapse
|
14
|
Hassanzadeh HR, Ying Sha, Wang MD. DeepDeath: Learning to predict the underlying cause of death with Big Data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2017:3373-3376. [PMID: 29060620 PMCID: PMC7324297 DOI: 10.1109/embc.2017.8037579] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Multiple cause-of-death data provides a valuable source of information that can be used to enhance health standards by predicting health related trajectories in societies with large populations. These data are often available in large quantities across U.S. states and require Big Data techniques to uncover complex hidden patterns. We design two different classes of models suitable for large-scale analysis of mortality data, a Hadoop-based ensemble of random forests trained over N-grams, and the DeepDeath, a deep classifier based on the recurrent neural network (RNN). We apply both classes to the mortality data provided by the National Center for Health Statistics and show that while both perform significantly better than the random classifier, the deep model that utilizes long short-term memory networks (LSTMs), surpasses the N-gram based models and is capable of learning the temporal aspect of the data without a need for building ad-hoc, expert-driven features.
Collapse
|
15
|
The A, C, G, and T of Genome Assembly. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6329217. [PMID: 27247941 PMCID: PMC4877455 DOI: 10.1155/2016/6329217] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 12/22/2015] [Indexed: 11/18/2022]
Abstract
Genome assembly in its two decades of history has produced significant research, in terms of both biotechnology and computational biology. This contribution delineates sequencing platforms and their characteristics, examines key steps involved in filtering and processing raw data, explains assembly frameworks, and discusses quality statistics for the assessment of the assembled sequence. Furthermore, the paper explores recent Ubuntu-based software environments oriented towards genome assembly as well as some avenues for future research.
Collapse
|
16
|
Alves JMP, de Oliveira AL, Sandberg TOM, Moreno-Gallego JL, de Toledo MAF, de Moura EMM, Oliveira LS, Durham AM, Mehnert DU, Zanotto PMDA, Reyes A, Gruber A. GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and its Application in Alpavirinae Viral Discovery from Metagenomic Data. Front Microbiol 2016; 7:269. [PMID: 26973638 PMCID: PMC4777721 DOI: 10.3389/fmicb.2016.00269] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Accepted: 02/19/2016] [Indexed: 01/01/2023] Open
Abstract
This work reports the development of GenSeed-HMM, a program that implements seed-driven progressive assembly, an approach to reconstruct specific sequences from unassembled data, starting from short nucleotide or protein seed sequences or profile Hidden Markov Models (HMM). The program can use any one of a number of sequence assemblers. Assembly is performed in multiple steps and relatively few reads are used in each cycle, consequently the program demands low computational resources. As a proof-of-concept and to demonstrate the power of HMM-driven progressive assemblies, GenSeed-HMM was applied to metagenomic datasets in the search for diverse ssDNA bacteriophages from the recently described Alpavirinae subfamily. Profile HMMs were built using Alpavirinae-specific regions from multiple sequence alignments (MSA) using either the viral protein 1 (VP1; major capsid protein) or VP4 (genome replication initiation protein). These profile HMMs were used by GenSeed-HMM (running Newbler assembler) as seeds to reconstruct viral genomes from sequencing datasets of human fecal samples. All contigs obtained were annotated and taxonomically classified using similarity searches and phylogenetic analyses. The most specific profile HMM seed enabled the reconstruction of 45 partial or complete Alpavirinae genomic sequences. A comparison with conventional (global) assembly of the same original dataset, using Newbler in a standalone execution, revealed that GenSeed-HMM outperformed global genomic assembly in several metrics employed. This approach is capable of detecting organisms that have not been used in the construction of the profile HMM, which opens up the possibility of diagnosing novel viruses, without previous specific information, constituting a de novo diagnosis. Additional applications include, but are not limited to, the specific assembly of extrachromosomal elements such as plastid and mitochondrial genomes from metagenomic data. Profile HMM seeds can also be used to reconstruct specific protein coding genes for gene diversity studies, and to determine all possible gene variants present in a metagenomic sample. Such surveys could be useful to detect the emergence of drug-resistance variants in sensitive environments such as hospitals and animal production facilities, where antibiotics are regularly used. Finally, GenSeed-HMM can be used as an adjunct for gap closure on assembly finishing projects, by using multiple contig ends as anchored seeds.
Collapse
Affiliation(s)
- João M P Alves
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - André L de Oliveira
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Tatiana O M Sandberg
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | | | - Marcelo A F de Toledo
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Elisabeth M M de Moura
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Liliane S Oliveira
- Department of Parasitology, Institute of Biomedical Sciences, University of São PauloSão Paulo, Brazil; Department of Computer Science, Institute of Mathematics and Statistics, University of São PauloSão Paulo, Brazil
| | - Alan M Durham
- Department of Computer Science, Institute of Mathematics and Statistics, University of São Paulo São Paulo, Brazil
| | - Dolores U Mehnert
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Paolo M de A Zanotto
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Alejandro Reyes
- Department of Biological Sciences, Universidad de los AndesBogotá, Colombia; Center for Genome Sciences and Systems Biology, Department of Pathology and Immunology, Washington University in Saint LouisMO, USA
| | - Arthur Gruber
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| |
Collapse
|
17
|
Orsini M, Cuccuru G, Uva P, Fotia G. Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era. Methods Mol Biol 2016; 1415:407-422. [PMID: 27115645 DOI: 10.1007/978-1-4939-3572-7_21] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Bacterial genome sequencing is now an affordable choice for many laboratories for applications in research, diagnostic, and clinical microbiology. Nowadays, an overabundance of tools is available for genomic data analysis. However, tools differ for algorithms, languages, hardware requirements, and user interface, and combining them as it is necessary for sequence data interpretation often requires (bio)informatics skills which can be difficult to find in many laboratories. In addition, multiple data sources, as well as exceedingly large dataset sizes, and increasingly computational complexity further challenge the accessibility, reproducibility, and transparency of the entire process. In this chapter we will cover the main bioinformatics steps required for a complete bacterial genome analysis using next-generation sequencing data, from the raw sequence data to assembled and annotated genomes. All the tools described are available in the Orione framework ( http://orione.crs4.it ), which uniquely combines in a transparent way the most used open source bioinformatics tools for microbiology, allowing microbiologist without any specific hardware or informatics skill to conduct data-intensive computational analyses from quality control to microbial gene annotation.
Collapse
Affiliation(s)
- Massimiliano Orsini
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy
| | - Gianmauro Cuccuru
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy
| | - Paolo Uva
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy
| | - Giorgio Fotia
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy.
| |
Collapse
|
18
|
Cunha MLR, Meijers JCM, Middeldorp S. Introduction to the analysis of next generation sequencing data and its application to venous thromboembolism. Thromb Haemost 2015; 114:920-32. [PMID: 26446408 DOI: 10.1160/th15-05-0411] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Accepted: 08/26/2015] [Indexed: 12/13/2022]
Abstract
Despite knowledge of various inherited risk factors associated with venous thromboembolism (VTE), no definite cause can be found in about 50% of patients. The application of data-driven searches such as GWAS has not been able to identify genetic variants with implications for clinical care, and unexplained heritability remains. In the past years, the development of several so-called next generation sequencing (NGS) platforms is offering the possibility of generating fast, inexpensive and accurate genomic information. However, so far their application to VTE has been very limited. Here we review basic concepts of NGS data analysis and explore the application of NGS technology to VTE. We provide both computational and biological viewpoints to discuss potentials and challenges of NGS-based studies.
Collapse
Affiliation(s)
- Marisa L R Cunha
- Marisa L. R. Cunha, Department of Experimental Vascular Medicine, Academic Medical Center, Meibergdreef 9, 1105 AZ Amsterdam, The Netherlands, Tel.: +31 20 5662824, Fax: +31 20 6968833, E-mail:
| | | | | |
Collapse
|
19
|
Abstract
Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This humble effort highlights the significance of genome assembly as a research area, focuses on its need to remain accurate, provides details about the characteristics of the raw data, examines some key metrics, emphasizes some tools and draws attention to a generic tutorial with example data that outlines the whole pipeline for next-generation sequencing. The article concludes by pointing out some major future research problems.
Collapse
|
20
|
Huang H, Dong Y, Yang ZL, Luo H, Zhang X, Gao F. Complete sequence of pABTJ2, a plasmid from Acinetobacter baumannii MDR-TJ, carrying many phage-like elements. GENOMICS PROTEOMICS & BIOINFORMATICS 2014; 12:172-7. [PMID: 25046542 PMCID: PMC4411360 DOI: 10.1016/j.gpb.2014.05.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Revised: 05/14/2014] [Accepted: 05/26/2014] [Indexed: 12/24/2022]
Abstract
Acinetobacter baumannii is an important opportunistic pathogen in hospital, and the multidrug-resistant isolates of A. baumannii have been increasingly reported in recent years. A number of different mechanisms of resistance have been reported, some of which are associated with plasmid-mediated acquisition of genes. Therefore, studies on plasmids in A. baumannii have been a hot issue lately. We have performed complete genome sequencing of A. baumannii MDR-TJ, which is a multidrug-resistant isolate. Finalizing the remaining large scaffold of the previous assembly, we found a new plasmid pABTJ2, which carries many phage-like elements. The plasmid pABTJ2 is a circular double-stranded DNA molecule, which is 110,967bp in length. We annotated 125 CDSs from pABTJ2 using IMG ER and ZCURVE_V, accounting for 88.28% of the whole plasmid sequence. Many phage-like elements and a tRNA-coding gene were detected in pABTJ2, which is rarely reported among A. baumannii. The tRNA gene is specific for asparagine codon GTT, which may be a small chromosomal sequence picked up through incorrect excision during plasmid formation. The phage-like elements may have been acquired during the integration process, as the GC content of the region carrying phage-like elements was higher than that of the adjacent regions. The finding of phage-like elements and tRNA-coding gene in pABTJ2 may provide a novel insight into the study of A. baumannii pan-plasmidome.
Collapse
Affiliation(s)
- He Huang
- MOE Key Laboratory of Systems Bioengineering, Department of Biochemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Yan Dong
- MOE Key Laboratory of Systems Bioengineering, Department of Biochemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Zhi-Liang Yang
- MOE Key Laboratory of Systems Bioengineering, Department of Biochemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Hao Luo
- Department of Physics, School of Science, Tianjin University, Tianjin 300072, China
| | - Xi Zhang
- Department of Physics, School of Science, Tianjin University, Tianjin 300072, China
| | - Feng Gao
- Department of Physics, School of Science, Tianjin University, Tianjin 300072, China; Collaborative Innovation Center of Chemical Science and Engineering, Tianjin 300072, China.
| |
Collapse
|
21
|
Hirakawa H, Shirasawa K, Kosugi S, Tashiro K, Nakayama S, Yamada M, Kohara M, Watanabe A, Kishida Y, Fujishiro T, Tsuruoka H, Minami C, Sasamoto S, Kato M, Nanri K, Komaki A, Yanagi T, Guoxin Q, Maeda F, Ishikawa M, Kuhara S, Sato S, Tabata S, Isobe SN. Dissection of the octoploid strawberry genome by deep sequencing of the genomes of Fragaria species. DNA Res 2013; 21:169-81. [PMID: 24282021 PMCID: PMC3989489 DOI: 10.1093/dnares/dst049] [Citation(s) in RCA: 130] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Cultivated strawberry (Fragaria x ananassa) is octoploid and shows allogamous behaviour. The present study aims at dissecting this octoploid genome through comparison with its wild relatives, F. iinumae, F. nipponica, F. nubicola, and F. orientalis by de novo whole-genome sequencing on an Illumina and Roche 454 platforms. The total length of the assembled Illumina genome sequences obtained was 698 Mb for F. x ananassa, and ∼200 Mb each for the four wild species. Subsequently, a virtual reference genome termed FANhybrid_r1.2 was constructed by integrating the sequences of the four homoeologous subgenomes of F. x ananassa, from which heterozygous regions in the Roche 454 and Illumina genome sequences were eliminated. The total length of FANhybrid_r1.2 thus created was 173.2 Mb with the N50 length of 5137 bp. The Illumina-assembled genome sequences of F. x ananassa and the four wild species were then mapped onto the reference genome, along with the previously published F. vesca genome sequence to establish the subgenomic structure of F. x ananassa. The strategy adopted in this study has turned out to be successful in dissecting the genome of octoploid F. x ananassa and appears promising when applied to the analysis of other polyploid plant species.
Collapse
Affiliation(s)
- Hideki Hirakawa
- 1 Kazusa DNA Research Institute, Kazusa-Kamatari 2-6-7, Kisarazu, Chiba 292-0818, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Kenny NJ, Quah S, Holland PWH, Tobe SS, Hui JHL. How are comparative genomics and the study of microRNAs changing our views on arthropod endocrinology and adaptations to the environment? Gen Comp Endocrinol 2013; 188:16-22. [PMID: 23480873 DOI: 10.1016/j.ygcen.2013.02.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/17/2012] [Accepted: 02/09/2013] [Indexed: 01/01/2023]
Abstract
As the last few decades of work has shown, precise regulation of biosynthesis and release of arthropod hormones is essential to cope with environmental stresses and challenges. In crustaceans and insects, the sesquiterpenoids methyl farnesoate (MF), farnesoic acid (FA) and juvenile hormone (JH) regulate many developmental, physiological, and reproductive processes. In this review, we discuss how comparative genomics has and will impact our views on arthropod endocrinology. We will also highlight the current knowledge of regulation of genes involved in arthropod hormone biosynthesis by microRNAs, and describe the potential insights into arthropod endocrinology, evolution, and adaptation that are likely to come from the study of microRNAs.
Collapse
Affiliation(s)
- Nathan J Kenny
- Department of Zoology, University of Oxford, South Parks Road, OX1 3PS, UK
| | | | | | | | | |
Collapse
|
23
|
Wajid B, Serpedin E, Nounou M, Nounou H. Optimal reference sequence selection for genome assembly using minimum description length principle. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2012. [PMID: 23186305 PMCID: PMC3608252 DOI: 10.1186/1687-4153-2012-18] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.
Collapse
Affiliation(s)
- Bilal Wajid
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA.
| | | | | | | |
Collapse
|
24
|
Review of general algorithmic features for genome assemblers for next generation sequencers. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 10:58-73. [PMID: 22768980 PMCID: PMC5054208 DOI: 10.1016/j.gpb.2012.05.006] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Accepted: 10/26/2011] [Indexed: 01/09/2023]
Abstract
In the realm of bioinformatics and computational biology, the most rudimentary data upon which all the analysis is built is the sequence data of genes, proteins and RNA. The sequence data of the entire genome is the solution to the genome assembly problem. The scope of this contribution is to provide an overview on the art of problem-solving applied within the domain of genome assembly in the next-generation sequencing (NGS) platforms. This article discusses the major genome assemblers that were proposed in the literature during the past decade by outlining their basic working principles. It is intended to act as a qualitative, not a quantitative, tutorial to all working on genome assemblers pertaining to the next generation of sequencers. We discuss the theoretical aspects of various genome assemblers, identifying their working schemes. We also discuss briefly the direction in which the area is headed towards along with discussing core issues on software simplicity.
Collapse
|