1
|
Lim PK, Wang R, Mutwil M. LSTrAP-denovo: Automated Generation of Transcriptome Atlases for Eukaryotic Species Without Genomes. PHYSIOLOGIA PLANTARUM 2024; 176:e14407. [PMID: 38973613 DOI: 10.1111/ppl.14407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 05/28/2024] [Indexed: 07/09/2024]
Abstract
Despite the abundance of species with transcriptomic data, a significant number of species still lack sequenced genomes, making it difficult to study gene function and expression in these organisms. While de novo transcriptome assembly can be used to assemble protein-coding transcripts from RNA-sequencing (RNA-seq) data, the datasets used often only feature samples of arbitrarily selected or similar experimental conditions, which might fail to capture condition-specific transcripts. We developed the Large-Scale Transcriptome Assembly Pipeline for de novo assembled transcripts (LSTrAP-denovo) to automatically generate transcriptome atlases of eukaryotic species. Specifically, given an NCBI TaxID, LSTrAP-denovo can (1) filter undesirable RNA-seq accessions based on read data, (2) select RNA-seq accessions via unsupervised machine learning to construct a sample-balanced dataset for download, (3) assemble transcripts via over-assembly, (4) functionally annotate coding sequences (CDS) from assembled transcripts and (5) generate transcriptome atlases in the form of expression matrices for downstream transcriptomic analyses. LSTrAP-denovo is easy to implement, written in Python, and is freely available at https://github.com/pengkenlim/LSTrAP-denovo/.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Ruoxi Wang
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
2
|
Fallon TR, Čalounová T, Mokrejš M, Weng JK, Pluskal T. transXpress: a Snakemake pipeline for streamlined de novo transcriptome assembly and annotation. BMC Bioinformatics 2023; 24:133. [PMID: 37016291 PMCID: PMC10074830 DOI: 10.1186/s12859-023-05254-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 03/24/2023] [Indexed: 04/06/2023] Open
Abstract
BACKGROUND RNA-seq followed by de novo transcriptome assembly has been a transformative technique in biological research of non-model organisms, but the computational processing of RNA-seq data entails many different software tools. The complexity of these de novo transcriptomics workflows therefore presents a major barrier for researchers to adopt best-practice methods and up-to-date versions of software. RESULTS Here we present a streamlined and universal de novo transcriptome assembly and annotation pipeline, transXpress, implemented in Snakemake. transXpress supports two popular assembly programs, Trinity and rnaSPAdes, and allows parallel execution on heterogeneous cluster computing hardware. CONCLUSIONS transXpress simplifies the use of best-practice methods and up-to-date software for de novo transcriptome assembly, and produces standardized output files that can be mined using SequenceServer to facilitate rapid discovery of new genes and proteins in non-model organisms.
Collapse
Affiliation(s)
- Timothy R Fallon
- Scripps Institution of Oceanography, UC San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Tereza Čalounová
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 2, 16000, Prague 6, Czech Republic
| | - Martin Mokrejš
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 2, 16000, Prague 6, Czech Republic
| | - Jing-Ke Weng
- Whitehead Institute for Biomedical Research, 455 Main Street, Cambridge, MA, 02142, USA.
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| | - Tomáš Pluskal
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 2, 16000, Prague 6, Czech Republic.
| |
Collapse
|
3
|
Krinos AI, Cohen NR, Follows MJ, Alexander H. Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly. BMC Bioinformatics 2023; 24:74. [PMID: 36869298 PMCID: PMC9983209 DOI: 10.1186/s12859-022-05121-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 12/21/2022] [Indexed: 03/05/2023] Open
Abstract
BACKGROUND Diverse communities of microbial eukaryotes in the global ocean provide a variety of essential ecosystem services, from primary production and carbon flow through trophic transfer to cooperation via symbioses. Increasingly, these communities are being understood through the lens of omics tools, which enable high-throughput processing of diverse communities. Metatranscriptomics offers an understanding of near real-time gene expression in microbial eukaryotic communities, providing a window into community metabolic activity. RESULTS Here we present a workflow for eukaryotic metatranscriptome assembly, and validate the ability of the pipeline to recapitulate real and manufactured eukaryotic community-level expression data. We also include an open-source tool for simulating environmental metatranscriptomes for testing and validation purposes. We reanalyze previously published metatranscriptomic datasets using our metatranscriptome analysis approach. CONCLUSION We determined that a multi-assembler approach improves eukaryotic metatranscriptome assembly based on recapitulated taxonomic and functional annotations from an in-silico mock community. The systematic validation of metatranscriptome assembly and annotation methods provided here is a necessary step to assess the fidelity of our community composition measurements and functional content assignments from eukaryotic metatranscriptomes.
Collapse
Affiliation(s)
- Arianna I Krinos
- MIT-WHOI Joint Program in Oceanography and Applied Ocean Science and Engineering, Cambridge and Woods Hole, MA, USA. .,Department of Biology, Woods Hole Oceanographic Institution, Woods Hole, MA, USA. .,Department of Earth, Atmospheric, and Planetary Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Natalie R Cohen
- Skidaway Institute of Oceanography, University of Georgia, Savannah, GA, USA
| | - Michael J Follows
- Department of Earth, Atmospheric, and Planetary Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Harriet Alexander
- Department of Biology, Woods Hole Oceanographic Institution, Woods Hole, MA, USA.
| |
Collapse
|
4
|
Salinas-Restrepo C, Misas E, Estrada-Gómez S, Quintana-Castillo JC, Guzman F, Calderón JC, Giraldo MA, Segura C. Improving the Annotation of the Venom Gland Transcriptome of Pamphobeteus verdolaga, Prospecting Novel Bioactive Peptides. Toxins (Basel) 2022; 14:408. [PMID: 35737069 PMCID: PMC9228390 DOI: 10.3390/toxins14060408] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 06/06/2022] [Accepted: 06/07/2022] [Indexed: 02/01/2023] Open
Abstract
Spider venoms constitute a trove of novel peptides with biotechnological interest. Paucity of next-generation-sequencing (NGS) data generation has led to a description of less than 1% of these peptides. Increasing evidence supports the underestimation of the assembled genes a single transcriptome assembler can predict. Here, the transcriptome of the venom gland of the spider Pamphobeteus verdolaga was re-assembled, using three free access algorithms, Trinity, SOAPdenovo-Trans, and SPAdes, to obtain a more complete annotation. Assembler's performance was evaluated by contig number, N50, read representation on the assembly, and BUSCO's terms retrieval against the arthropod dataset. Out of all the assembled sequences with all software, 39.26% were common between the three assemblers, and 27.88% were uniquely assembled by Trinity, while 27.65% were uniquely assembled by SPAdes. The non-redundant merging of all three assemblies' output permitted the annotation of 9232 sequences, which was 23% more when compared to each software and 28% more when compared to the previous P. verdolaga annotation; moreover, the description of 65 novel theraphotoxins was possible. In the generation of data for non-model organisms, as well as in the search for novel peptides with biotechnological interest, it is highly recommended to employ at least two different transcriptome assemblers.
Collapse
Affiliation(s)
- Cristian Salinas-Restrepo
- Grupo Toxinología, Alternativas Terapéuticas y Alimentarias, Facultad de Ciencias Farmacéuticas y Alimentarias, Universidad de Antioquia, Medellín 050012, Colombia; (C.S.-R.); (S.E.-G.)
| | - Elizabeth Misas
- Corporación para Investigaciones Biológicas, Medellín 050012, Colombia;
| | - Sebastian Estrada-Gómez
- Grupo Toxinología, Alternativas Terapéuticas y Alimentarias, Facultad de Ciencias Farmacéuticas y Alimentarias, Universidad de Antioquia, Medellín 050012, Colombia; (C.S.-R.); (S.E.-G.)
- Centro de Investigación en Recursos Naturales y Sustentabilidad, Universidad Bernardo O’Higgins, Aven-ida Viel 1497, Santiago 7750000, Chile
| | | | - Fanny Guzman
- Núcleo Biotecnología Curauma (NBC), Pontifícia Universidad Católica de Valparaíso, Valparaíso 2374631, Chile;
| | - Juan C. Calderón
- Physiology and Biochemistry Research Group-PHYSIS, Faculty of Medicine, University of Antioquia, Medellín 050012, Colombia;
| | - Marco A. Giraldo
- Biophysics Group, Institute of Physics, University of Antioquia, Medellín 050012, Colombia;
| | - Cesar Segura
- Grupo Malaria, Facultad de Medicina, Universidad de Antioquia, Medellín 050012, Colombia
| |
Collapse
|
5
|
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23:6514404. [PMID: 35076693 PMCID: PMC8921630 DOI: 10.1093/bib/bbab563] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/03/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Collapse
Affiliation(s)
- Venket Raghavan
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | - Louis Kraft
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | | | | |
Collapse
|