1
|
Lim PK, Wang R, Mutwil M. LSTrAP-denovo: Automated Generation of Transcriptome Atlases for Eukaryotic Species Without Genomes. PHYSIOLOGIA PLANTARUM 2024; 176:e14407. [PMID: 38973613 DOI: 10.1111/ppl.14407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 05/28/2024] [Indexed: 07/09/2024]
Abstract
Despite the abundance of species with transcriptomic data, a significant number of species still lack sequenced genomes, making it difficult to study gene function and expression in these organisms. While de novo transcriptome assembly can be used to assemble protein-coding transcripts from RNA-sequencing (RNA-seq) data, the datasets used often only feature samples of arbitrarily selected or similar experimental conditions, which might fail to capture condition-specific transcripts. We developed the Large-Scale Transcriptome Assembly Pipeline for de novo assembled transcripts (LSTrAP-denovo) to automatically generate transcriptome atlases of eukaryotic species. Specifically, given an NCBI TaxID, LSTrAP-denovo can (1) filter undesirable RNA-seq accessions based on read data, (2) select RNA-seq accessions via unsupervised machine learning to construct a sample-balanced dataset for download, (3) assemble transcripts via over-assembly, (4) functionally annotate coding sequences (CDS) from assembled transcripts and (5) generate transcriptome atlases in the form of expression matrices for downstream transcriptomic analyses. LSTrAP-denovo is easy to implement, written in Python, and is freely available at https://github.com/pengkenlim/LSTrAP-denovo/.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Ruoxi Wang
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
3
|
Delli-Ponti R, Shivhare D, Mutwil M. Using Gene Expression to Study Specialized Metabolism-A Practical Guide. FRONTIERS IN PLANT SCIENCE 2021; 11:625035. [PMID: 33510763 PMCID: PMC7835209 DOI: 10.3389/fpls.2020.625035] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 11/30/2020] [Indexed: 05/25/2023]
Abstract
Plants produce a vast array of chemical compounds that we use as medicines and flavors, but these compounds' biosynthetic pathways are still poorly understood. This paucity precludes us from modifying, improving, and mass-producing these specialized metabolites in suitable bioreactors. Many of the specialized metabolites are expressed in a narrow range of organs, tissues, and cell types, suggesting a tight regulation of the responsible biosynthetic pathways. Fortunately, with unprecedented ease of generating gene expression data and with >200,000 publicly available RNA sequencing samples, we are now able to study the expression of genes from hundreds of plant species. This review demonstrates how gene expression can elucidate the biosynthetic pathways by mining organ-specific genes, gene expression clusters, and applying various types of co-expression analyses. To empower biologists to perform these analyses, we showcase these analyses using recently published, user-friendly tools. Finally, we analyze the performance of co-expression networks and show that they are a valuable addition to elucidating multiple the biosynthetic pathways of specialized metabolism.
Collapse
Affiliation(s)
| | | | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
4
|
Lim JJJ, Koh J, Moo JR, Villanueva EMF, Putri DA, Lim YS, Seetoh WS, Mulupuri S, Ng JWZ, Nguyen NLU, Reji R, Foo H, Zhao MX, Chan TL, Rodrigues EE, Kairon RS, Hee KM, Chee NC, Low AD, Chen ZHX, Lim SC, Lunardi V, Fong TC, Chua CX, Koh KTS, Julca I, Delli-Ponti R, Ng JWX, Mutwil M. Fungi.guru: Comparative genomic and transcriptomic resource for the fungi kingdom. Comput Struct Biotechnol J 2020; 18:3788-3795. [PMID: 33304470 PMCID: PMC7718472 DOI: 10.1016/j.csbj.2020.11.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 11/10/2020] [Accepted: 11/10/2020] [Indexed: 12/30/2022] Open
Abstract
The fungi kingdom is composed of eukaryotic heterotrophs, which are responsible for balancing the ecosystem and play a major role as decomposers. They also produce a vast diversity of secondary metabolites, which have antibiotic or pharmacological properties. However, our lack of knowledge of gene function in fungi precludes us from tailoring them to our needs and tapping into their metabolic diversity. To help remedy this, we gathered genomic and gene expression data of 19 most widely-researched fungi to build an online tool, fungi.guru, which contains tools for cross-species identification of conserved pathways, functional gene modules, and gene families. We exemplify how our tool can elucidate the molecular function, biological process and cellular component of genes involved in various biological processes, by identifying a secondary metabolite pathway producing gliotoxin in Aspergillus fumigatus, the catabolic pathway of cellulose in Coprinopsis cinerea and the conserved DNA replication pathway in Fusarium graminearum and Pyricularia oryzae. The tool is available at www.fungi.guru.
Collapse
Affiliation(s)
- Jolyn Jia Jia Lim
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Jace Koh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Jia Rong Moo
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | | | - Dhira Anindya Putri
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Yuen Shan Lim
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Wei Song Seetoh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Sriya Mulupuri
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Janice Wan Zhen Ng
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Nhi Le Uyen Nguyen
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Rinta Reji
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Herman Foo
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Margaret Xuan Zhao
- College of Medicine and Veterinary Medicine, University of Edinburgh, Old College, South Bridge, Edinburgh EH8 9YL, United Kingdom
| | - Tong Ling Chan
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Edbert Edric Rodrigues
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Ryanjit Singh Kairon
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Ker Min Hee
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Natasha Cassandra Chee
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Ann Don Low
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Zoe Hui Xin Chen
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Shan Chun Lim
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Vanessa Lunardi
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Tuck Choy Fong
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Cherlyn Xin'Er Chua
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Kenny Ting Sween Koh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Irene Julca
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Riccardo Delli-Ponti
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Jonathan Wei Xiong Ng
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| |
Collapse
|
5
|
Hew B, Tan QW, Goh W, Ng JWX, Mutwil M. LSTrAP-Crowd: prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data. BMC Biol 2020; 18:114. [PMID: 32883264 PMCID: PMC7470450 DOI: 10.1186/s12915-020-00846-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 08/12/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Bacterial resistance to antibiotics is a growing health problem that is projected to cause more deaths than cancer by 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the structurally conserved bacterial ribosomes, factors involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. Here, we use a bioinformatics approach to identify novel components of protein synthesis. RESULTS In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. CONCLUSIONS We identified genes related to protein synthesis in common bacterial pathogens and thus provide a resource of potential antibiotic development targets for experimental validation. The data can be used to explore additional vulnerabilities of bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowd-sourced.
Collapse
Affiliation(s)
- Benedict Hew
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Qiao Wen Tan
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - William Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Jonathan Wei Xiong Ng
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
6
|
LSTrAP-Cloud: A User-Friendly Cloud Computing Pipeline to Infer Coexpression Networks. Genes (Basel) 2020; 11:genes11040428. [PMID: 32316247 PMCID: PMC7230309 DOI: 10.3390/genes11040428] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 04/08/2020] [Accepted: 04/13/2020] [Indexed: 12/15/2022] Open
Abstract
As genomes become more and more available, gene function prediction presents itself as one of the major hurdles in our quest to extract meaningful information on the biological processes genes participate in. In order to facilitate gene function prediction, we show how our user-friendly pipeline, the Large-Scale Transcriptomic Analysis Pipeline in Cloud (LSTrAP-Cloud), can be useful in helping biologists make a shortlist of genes involved in a biological process that they might be interested in, by using a single gene of interest as bait. The LSTrAP-Cloud is based on Google Colaboratory, and provides user-friendly tools that process quality-control RNA sequencing data streamed from the European Nucleotide Archive. The LSTRAP-Cloud outputs a gene coexpression network that can be used to identify functionally related genes for any organism with a sequenced genome and publicly available RNA sequencing data. Here, we used the biosynthesis pathway of Nicotiana tabacum as a case study to demonstrate how enzymes, transporters, and transcription factors involved in the synthesis, transport, and regulation of nicotine can be identified using our pipeline.
Collapse
|
7
|
Giansanti V, Tang M, Cittaro D. Fast analysis of scATAC-seq data using a predefined set of genomic regions. F1000Res 2020; 9:199. [PMID: 32595951 PMCID: PMC7308914 DOI: 10.12688/f1000research.22731.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/13/2020] [Indexed: 02/11/2024] Open
Abstract
Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. Results: We found that kallisto does not introduce biases in quantification of known peaks and cells groups are identified in a consistent way. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.
Collapse
Affiliation(s)
- Valentina Giansanti
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
| | - Ming Tang
- FAS informatics, Harvard University, Cambridge, MA, USA
| | - Davide Cittaro
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
| |
Collapse
|
8
|
Abstract
Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. We subsequently tested our approach on scATAC-seq data for K562 cell line. Results: We found that kallisto does not introduce biases in quantification of known peaks; cells groups identified are consistent with the ones identified from standard method. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.
Collapse
Affiliation(s)
- Valentina Giansanti
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
| | - Ming Tang
- FAS informatics, Harvard University, Cambridge, MA, USA
| | - Davide Cittaro
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
| |
Collapse
|