1
|
Jackson DJ, Cerveau N, Posnien N. De novo assembly of transcriptomes and differential gene expression analysis using short-read data from emerging model organisms - a brief guide. Front Zool 2024; 21:17. [PMID: 38902827 PMCID: PMC11188175 DOI: 10.1186/s12983-024-00538-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 06/12/2024] [Indexed: 06/22/2024] Open
Abstract
Many questions in biology benefit greatly from the use of a variety of model systems. High-throughput sequencing methods have been a triumph in the democratization of diverse model systems. They allow for the economical sequencing of an entire genome or transcriptome of interest, and with technical variations can even provide insight into genome organization and the expression and regulation of genes. The analysis and biological interpretation of such large datasets can present significant challenges that depend on the 'scientific status' of the model system. While high-quality genome and transcriptome references are readily available for well-established model systems, the establishment of such references for an emerging model system often requires extensive resources such as finances, expertise and computation capabilities. The de novo assembly of a transcriptome represents an excellent entry point for genetic and molecular studies in emerging model systems as it can efficiently assess gene content while also serving as a reference for differential gene expression studies. However, the process of de novo transcriptome assembly is non-trivial, and as a rule must be empirically optimized for every dataset. For the researcher working with an emerging model system, and with little to no experience with assembling and quantifying short-read data from the Illumina platform, these processes can be daunting. In this guide we outline the major challenges faced when establishing a reference transcriptome de novo and we provide advice on how to approach such an endeavor. We describe the major experimental and bioinformatic steps, provide some broad recommendations and cautions for the newcomer to de novo transcriptome assembly and differential gene expression analyses. Moreover, we provide an initial selection of tools that can assist in the journey from raw short-read data to assembled transcriptome and lists of differentially expressed genes.
Collapse
Affiliation(s)
- Daniel J Jackson
- University of Göttingen, Department of Geobiology, Goldschmidtstr.3, Göttingen, 37077, Germany.
| | - Nicolas Cerveau
- University of Göttingen, Department of Geobiology, Goldschmidtstr.3, Göttingen, 37077, Germany
| | - Nico Posnien
- University of Göttingen, Department of Developmental Biology, GZMB, Justus-Von-Liebig-Weg 11, Göttingen, 37077, Germany.
| |
Collapse
|
2
|
Hartmann M, Herzog C, Brunner I, Stierli B, Meyer F, Buchmann N, Frey B. Long-term mitigation of drought changes the functional potential and life-strategies of the forest soil microbiome involved in organic matter decomposition. Front Microbiol 2023; 14:1267270. [PMID: 37840720 PMCID: PMC10570739 DOI: 10.3389/fmicb.2023.1267270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/14/2023] [Indexed: 10/17/2023] Open
Abstract
Climate change can alter the flow of nutrients and energy through terrestrial ecosystems. Using an inverse climate change field experiment in the central European Alps, we explored how long-term irrigation of a naturally drought-stressed pine forest altered the metabolic potential of the soil microbiome and its ability to decompose lignocellulolytic compounds as a critical ecosystem function. Drought mitigation by a decade of irrigation stimulated profound changes in the functional capacity encoded in the soil microbiome, revealing alterations in carbon and nitrogen metabolism as well as regulatory processes protecting microorganisms from starvation and desiccation. Despite the structural and functional shifts from oligotrophic to copiotrophic microbial lifestyles under irrigation and the observation that different microbial taxa were involved in the degradation of cellulose and lignin as determined by a time-series stable-isotope probing incubation experiment with 13C-labeled substrates, degradation rates of these compounds were not affected by different water availabilities. These findings provide new insights into the impact of precipitation changes on the soil microbiome and associated ecosystem functioning in a drought-prone pine forest and will help to improve our understanding of alterations in biogeochemical cycling under a changing climate.
Collapse
Affiliation(s)
- Martin Hartmann
- Department of Environmental Systems Science, Sustainable Agroecosystems, Institute of Agricultural Sciences, ETH Zürich, Zürich, Switzerland
- Forest Soils and Biogeochemistry, Swiss Federal Research Institute WSL, Birmensdorf, Switzerland
| | - Claude Herzog
- Forest Soils and Biogeochemistry, Swiss Federal Research Institute WSL, Birmensdorf, Switzerland
- Department of Environmental Systems Science, Grassland Sciences, Institute of Agricultural Sciences, ETH Zürich, Zürich, Switzerland
| | - Ivano Brunner
- Forest Soils and Biogeochemistry, Swiss Federal Research Institute WSL, Birmensdorf, Switzerland
| | - Beat Stierli
- Forest Soils and Biogeochemistry, Swiss Federal Research Institute WSL, Birmensdorf, Switzerland
| | - Folker Meyer
- Data Science, Institute for AI in Medicine, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Argonne National Laboratory, Argonne, IL, United States
- Computation Institute, University of Chicago, Chicago, IL, United States
- Department of Medicine, University of Chicago, Chicago, IL, United States
| | - Nina Buchmann
- Department of Environmental Systems Science, Grassland Sciences, Institute of Agricultural Sciences, ETH Zürich, Zürich, Switzerland
| | - Beat Frey
- Forest Soils and Biogeochemistry, Swiss Federal Research Institute WSL, Birmensdorf, Switzerland
| |
Collapse
|
3
|
Parker J, Dubin A, Schneider R, Wagner KS, Jentoft S, Böhne A, Bayer T, Roth O. Immunological tolerance in the evolution of male pregnancy. Mol Ecol 2023; 32:819-840. [PMID: 34951070 DOI: 10.1111/mec.16333] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 12/12/2021] [Accepted: 12/14/2021] [Indexed: 11/29/2022]
Abstract
The unique male pregnancy in pipefishes and seahorses ranges from basic attachment (pouch-less species: Nerophinae) of maternal eggs to specialized internal gestation in pouched species (e.g. Syngnathus and Hippocampus) with many transitions in between. Due to this diversity, male pregnancy offers a unique platform for assessing physiological and molecular adaptations in pregnancy evolution. These insights will contribute to answering long-standing questions of why and how pregnancy evolved convergently in so many vertebrate systems. To understand the molecular congruencies and disparities in male pregnancy evolution, we compared transcriptome-wide differentially expressed genes in four syngnathid species, at four pregnancy stages (nonpregnant, early, late and parturition). Across all species and pregnancy forms, metabolic processes and immune dynamics defined pregnancy stages, especially pouched species shared expression features akin to female pregnancy. The observed downregulation of adaptive immune genes in early-stage pregnancy and its reversed upregulation during late/parturition in pouched species, most notably in Hippocampus, combined with directionless expression in the pouch-less species, suggests immune modulation to be restricted to pouched species that evolved placenta-like systems. We propose that increased foeto-paternal intimacy in pouched syngnathids commands immune suppression processes in early gestation, and that the elevated immune response during parturition coincides with pouch opening and reduced progeny reliance. Immune response regulation in pouched species supports the recently described functional MHC II pathway loss as critical in male pregnancy evolution. The independent co-option of similar genes and pathways both in male and female pregnancy highlights immune modulation as crucial for the evolutionary establishment of pregnancy.
Collapse
Affiliation(s)
- Jamie Parker
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
| | - Arseny Dubin
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
| | - Ralf Schneider
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
| | - Kim Sara Wagner
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
| | - Sissel Jentoft
- Department of Biosciences, Centre for Ecological and Evolutionary Synthesis, University of Oslo, Oslo, Norway
| | - Astrid Böhne
- Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, Bonn, Germany
| | - Till Bayer
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
| | - Olivia Roth
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
| |
Collapse
|
4
|
Pottier M, Castagnet S, Gravey F, Leduc G, Sévin C, Petry S, Giard JC, Le Hello S, Léon A. Antimicrobial Resistance and Genetic Diversity of Pseudomonas aeruginosa Strains Isolated from Equine and Other Veterinary Samples. Pathogens 2022; 12:64. [PMID: 36678412 PMCID: PMC9867525 DOI: 10.3390/pathogens12010064] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 12/20/2022] [Accepted: 12/26/2022] [Indexed: 01/03/2023] Open
Abstract
Pseudomonas aeruginosa is one of the leading causes of healthcare-associated infections in humans. This bacterium is less represented in veterinary medicine, despite causing difficult-to-treat infections due to its capacity to acquire antimicrobial resistance, produce biofilms, and persist in the environment, along with its limited number of veterinary antibiotic therapies. Here, we explored susceptibility profiles to antibiotics and to didecyldimethylammonium chloride (DDAC), a quaternary ammonium widely used as a disinfectant, in 168 P. aeruginosa strains isolated from animals, mainly Equidae. A genomic study was performed on 41 of these strains to determine their serotype, sequence type (ST), relatedness, and resistome. Overall, 7.7% of animal strains were resistant to carbapenems, 10.1% presented a multidrug-resistant (MDR) profile, and 11.3% showed decreased susceptibility (DS) to DDAC. Genomic analyses revealed that the study population was diverse, and 4.9% were ST235, which is considered the most relevant human high-risk clone worldwide. This study found P. aeruginosa populations with carbapenem resistance, multidrug resistance, and DS to DDAC in equine and canine isolates. These strains, which are not susceptible to antibiotics used in veterinary and human medicine, warrant close the setting up of a clone monitoring, based on that already in place in human medicine, in a one-health approach.
Collapse
Affiliation(s)
- Marine Pottier
- Research Department, LABÉO, 14053 Caen, France
- Inserm UMR 1311, Dynamicure, Normandie University, UNICAEN, UNIROUEN, 14000 Caen, France
| | - Sophie Castagnet
- Research Department, LABÉO, 14053 Caen, France
- Inserm UMR 1311, Dynamicure, Normandie University, UNICAEN, UNIROUEN, 14000 Caen, France
| | - François Gravey
- Inserm UMR 1311, Dynamicure, Normandie University, UNICAEN, UNIROUEN, 14000 Caen, France
- CHU de Caen, Service de Microbiologie, Avenue de la Côte de Nacre, 14033 Caen, France
| | - Guillaume Leduc
- CHU de Caen, Service de Microbiologie, Avenue de la Côte de Nacre, 14033 Caen, France
| | - Corinne Sévin
- Anses, Normandy Laboratory for Animal Health, Physiopathology and Epidemiology of Equine Diseases Unit, 14430 Goustranville, France
| | - Sandrine Petry
- Anses, Normandy Laboratory for Animal Health, Physiopathology and Epidemiology of Equine Diseases Unit, 14430 Goustranville, France
| | - Jean-Christophe Giard
- Inserm UMR 1311, Dynamicure, Normandie University, UNICAEN, UNIROUEN, 14000 Caen, France
| | - Simon Le Hello
- Inserm UMR 1311, Dynamicure, Normandie University, UNICAEN, UNIROUEN, 14000 Caen, France
- CHU de Caen, Service de Microbiologie, Avenue de la Côte de Nacre, 14033 Caen, France
- CHU de Caen, Service d’Hygiène Hospitalière, Avenue de la Côte de Nacre, 14033 Caen, France
| | - Albertine Léon
- Research Department, LABÉO, 14053 Caen, France
- Inserm UMR 1311, Dynamicure, Normandie University, UNICAEN, UNIROUEN, 14000 Caen, France
| |
Collapse
|
5
|
Lugli GA, Longhi G, Mancabelli L, Alessandri G, Tarracchini C, Fontana F, Turroni F, Milani C, van Sinderen D, Ventura M. Tap water as a natural vehicle for microorganisms shaping the human gut microbiome. Environ Microbiol 2022; 24:3912-3923. [PMID: 35355372 PMCID: PMC9790288 DOI: 10.1111/1462-2920.15988] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 03/24/2022] [Accepted: 03/25/2022] [Indexed: 12/30/2022]
Abstract
Fresh potable water is an indispensable drink which humans consume daily in substantial amounts. Nonetheless, very little is known about the composition of the microbial community inhabiting drinking water or its impact on our gut microbiota. In the current study, an exhaustive shotgun metagenomics analysis of the tap water microbiome highlighted the occurrence of a highly genetic biodiversity of the microbial communities residing in fresh water and the existence of a conserved core tap water microbiota largely represented by novel microbial species, representing microbial dark matter. Furthermore, genome reconstruction of this microbial dark matter from water samples unveiled homologous sequences present in the faecal microbiome of humans from various geographical locations. Accordingly, investigation of the faecal microbiota content of a subject that daily consumed tap water for 3 years provides proof for horizontal transmission and colonization of water bacteria in the human gut.
Collapse
Affiliation(s)
- Gabriele Andrea Lugli
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly
| | - Giulia Longhi
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly,GenProbio SrlParmaItaly
| | - Leonardo Mancabelli
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly
| | - Giulia Alessandri
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly
| | - Chiara Tarracchini
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly
| | - Federico Fontana
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly,GenProbio SrlParmaItaly
| | - Francesca Turroni
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly,Microbiome Research HubUniversity of ParmaParmaItaly
| | - Christian Milani
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly,Microbiome Research HubUniversity of ParmaParmaItaly
| | - Douwe van Sinderen
- APC Microbiome Institute and School of Microbiology, Bioscience Institute, National University of IrelandCorkIreland
| | - Marco Ventura
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental SustainabilityUniversity of ParmaParmaItaly,Microbiome Research HubUniversity of ParmaParmaItaly
| |
Collapse
|
6
|
Parker J, Roth O. Comparative assessment of immunological tolerance in fish with natural immunodeficiency. DEVELOPMENTAL AND COMPARATIVE IMMUNOLOGY 2022; 132:104393. [PMID: 35276317 DOI: 10.1016/j.dci.2022.104393] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 02/24/2022] [Accepted: 03/05/2022] [Indexed: 06/14/2023]
Abstract
Natural occurrences of immunodeficiency by definition should lead to compromised immune function. The major histocompatibility complexes (MHC) are key components of the vertebrate adaptive immune system, charged with mediating allorecognition and antigen presentation functions. To this end, the genomic loss of the MHC II pathway in Syngnathus pipefishes raises questions regarding their immunological vigilance and allorecognition capabilities. Utilising allograft and autograft fin-transplants, we compared the allorecognition immune responses of two pipefish species, with (Nerophis ophidion) and without (Syngnathus typhle) a functional MHC II. Transcriptome-wide assessments explored the immunological tolerance and potential compensatory measures occupying the role of the absent MHC II. Visual observations suggested a more acute rejection response in N. ophidion allografts compared with S. typhle allografts. Differentially expressed genes involved in innate immunity, angiogenesis and tissue recovery were identified among transplantees. The intriguing upregulation of the cytotoxic T-cell implicated gzma in S. typhle allografts, suggests a prominent MHC I related response, which may compensate for the MHC II and CD4 loss. MHC I related downregulation in N. ophidion autografts hints at an immunological tolerance related reaction. These findings may indicate alternative measures evolved to cope with the MHC II genomic loss enabling the maintenance of appropriate tolerance levels. This study provides intriguing insights into the immune and tissue recovery mechanisms associated with syngnathid transplantation, and can be a useful reference for future studies focusing on transplantation transcriptomics in non-model systems.
Collapse
Affiliation(s)
- Jamie Parker
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, D-24105, Kiel, Germany; Marine Evolutionary Biology, Christian-Albrechts-University, D-24118, Kiel, Germany.
| | - Olivia Roth
- Marine Evolutionary Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel, D-24105, Kiel, Germany; Marine Evolutionary Biology, Christian-Albrechts-University, D-24118, Kiel, Germany
| |
Collapse
|
7
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
Motivation The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. Results In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. Availability and implementation SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, Padova, 35131, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, Padova, 35131, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, 35131, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, Padova, 35131, Italy
| |
Collapse
|
8
|
Palma F, Mangone I, Janowicz A, Moura A, Chiaverini A, Torresi M, Garofolo G, Criscuolo A, Brisse S, Di Pasquale A, Cammà C, Radomski N. In vitro and in silico parameters for precise cgMLST typing of Listeria monocytogenes. BMC Genomics 2022; 23:235. [PMID: 35346021 PMCID: PMC8961897 DOI: 10.1186/s12864-022-08437-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 02/28/2022] [Indexed: 02/02/2023] Open
Abstract
Background Whole genome sequencing analyzed by core genome multi-locus sequence typing (cgMLST) is widely used in surveillance of the pathogenic bacteria Listeria monocytogenes. Given the heterogeneity of available bioinformatics tools to define cgMLST alleles, our aim was to identify parameters influencing the precision of cgMLST profiles. Methods We used three L. monocytogenes reference genomes from different phylogenetic lineages and assessed the impact of in vitro (i.e. tested genomes, successive platings, replicates of DNA extraction and sequencing) and in silico parameters (i.e. targeted depth of coverage, depth of coverage, breadth of coverage, assembly metrics, cgMLST workflows, cgMLST completeness) on cgMLST precision made of 1748 core loci. Six cgMLST workflows were tested, comprising assembly-based (BIGSdb, INNUENDO, GENPAT, SeqSphere and BioNumerics) and assembly-free (i.e. kmer-based MentaLiST) allele callers. Principal component analyses and generalized linear models were used to identify the most impactful parameters on cgMLST precision. Results The isolate’s genetic background, cgMLST workflows, cgMLST completeness, as well as depth and breadth of coverage were the parameters that impacted most on cgMLST precision (i.e. identical alleles against reference circular genomes). All workflows performed well at ≥40X of depth of coverage, with high loci detection (> 99.54% for all, except for BioNumerics with 97.78%) and showed consistent cluster definitions using the reference cut-off of ≤7 allele differences. Conclusions This highlights that bioinformatics workflows dedicated to cgMLST allele calling are largely robust when paired-end reads are of high quality and when the sequencing depth is ≥40X. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08437-4.
Collapse
|
9
|
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23:6514404. [PMID: 35076693 PMCID: PMC8921630 DOI: 10.1093/bib/bbab563] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/03/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Collapse
Affiliation(s)
- Venket Raghavan
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | - Louis Kraft
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | | | | |
Collapse
|
10
|
Elworth RAL, Wang Q, Kota PK, Barberan CJ, Coleman B, Balaji A, Gupta G, Baraniuk RG, Shrivastava A, Treangen T. To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Res 2020; 48:5217-5234. [PMID: 32338745 PMCID: PMC7261164 DOI: 10.1093/nar/gkaa265] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 03/20/2020] [Accepted: 04/04/2020] [Indexed: 02/01/2023] Open
Abstract
As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
Collapse
Affiliation(s)
| | - Qi Wang
- Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Houston, TX 77005, USA
| | - Pavan K Kota
- Department of Bioengineering, Houston, TX 77005, USA
| | - C J Barberan
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Benjamin Coleman
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Advait Balaji
- Department of Computer Science, Houston, TX 77005, USA
| | - Gaurav Gupta
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Richard G Baraniuk
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Anshumali Shrivastava
- Department of Computer Science, Houston, TX 77005, USA
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Todd J Treangen
- Department of Computer Science, Houston, TX 77005, USA
- Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Houston, TX 77005, USA
| |
Collapse
|
11
|
Abstract
Specialized de novo assemblers for diverse datatypes have been developed and are in widespread use for the analyses of single-cell genomics, metagenomics and RNA-seq data. However, assembly of large sequencing datasets produced by modern technologies is challenging and computationally intensive. In-silico read normalization has been suggested as a computational strategy to reduce redundancy in read datasets, which leads to significant speedups and memory savings of assembly pipelines. Previously, we presented a set multi-cover optimization based approach, ORNA, where reads are reduced without losing important k-mer connectivity information, as used in assembly graphs. Here we propose extensions to ORNA, named ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further. We devise efficient heuristic algorithms for solving both formulations. In applications to human RNA-seq data, ORNA-Q and ORNA-K are shown to assemble more or equally many full length transcripts compared to other normalization methods at similar or higher read reduction values. The algorithm is implemented under the latest version of ORNA (v2.0, https://github.com/SchulzLab/ORNA).
Collapse
|
12
|
Behdju M, Meyer U. DFG Priority Programme SPP 1736: Algorithms for Big Data. KUNSTLICHE INTELLIGENZ 2018. [DOI: 10.1007/s13218-017-0518-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|