26
|
Prjibelski AD, Puglia GD, Antipov D, Bushmanova E, Giordano D, Mikheenko A, Vitale D, Lapidus A. Extending rnaSPAdes functionality for hybrid transcriptome assembly. BMC Bioinformatics 2020; 21:302. [PMID: 32703149 PMCID: PMC7379828 DOI: 10.1186/s12859-020-03614-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2020] [Accepted: 06/18/2020] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND De novo RNA-Seq assembly is a powerful method for analysing transcriptomes when the reference genome is not available or poorly annotated. However, due to the short length of Illumina reads it is usually impossible to reconstruct complete sequences of complex genes and alternative isoforms. Recently emerged possibility to generate long RNA reads, such as PacBio and Oxford Nanopores, may dramatically improve the assembly quality, and thus the consecutive analysis. While reference-based tools for analysing long RNA reads were recently developed, there is no established pipeline for de novo assembly of such data. RESULTS In this work we present a novel method that allows to perform high-quality de novo transcriptome assemblies by combining accuracy and reliability of short reads with exon structure information carried out from long error-prone reads. The algorithm is designed by incorporating existing hybridSPAdes approach into rnaSPAdes pipeline and adapting it for transcriptomic data. CONCLUSION To evaluate the benefit of using long RNA reads we selected several datasets containing both Illumina and Iso-seq or Oxford Nanopore Technologies (ONT) reads. Using an existing quality assessment software, we show that hybrid assemblies performed with rnaSPAdes contain more full-length genes and alternative isoforms comparing to the case when only short-read data is used.
Collapse
|
27
|
Catalogue of stage-specific transcripts in Ixodes ricinus and their potential functions during the tick life-cycle. Parasit Vectors 2020; 13:311. [PMID: 32546252 PMCID: PMC7296661 DOI: 10.1186/s13071-020-04173-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 06/05/2020] [Indexed: 12/15/2022] Open
Abstract
Background The castor bean tick Ixodes ricinus is an important vector of several clinically important diseases, whose prevalence increases with accelerating global climate changes. Characterization of a tick life-cycle is thus of great importance. However, researchers mainly focus on specific organs of fed life stages, while early development of this tick species is largely neglected. Methods In an attempt to better understand the life-cycle of this widespread arthropod parasite, we sequenced the transcriptomes of four life stages (egg, larva, nymph and adult female), including unfed and partially blood-fed individuals. To enable a more reliable identification of transcripts and their comparison in all five transcriptome libraries, we validated an improved-fit set of five I. ricinus-specific reference genes for internal standard normalization of our transcriptomes. Then, we mapped biological functions to transcripts identified in different life stages (clusters) to elucidate life stage-specific processes. Finally, we drew conclusions from the functional enrichment of these clusters specifically assigned to each transcriptome, also in the context of recently published transcriptomic studies in ticks. Results We found that reproduction-related transcripts are present in both fed nymphs and fed females, underlining the poorly documented importance of ovaries as moulting regulators in ticks. Additionally, we identified transposase transcripts in tick eggs suggesting elevated transposition during embryogenesis, co-activated with factors driving developmental regulation of gene expression. Our findings also highlight the importance of the regulation of energetic metabolism in tick eggs during embryonic development and glutamate metabolism in nymphs. Conclusions Our study presents novel insights into stage-specific transcriptomes of I. ricinus and extends the current knowledge of this medically important pathogen, especially in the early phases of its development.![]()
Collapse
|
28
|
Arce-Leal ÁP, Bautista R, Rodríguez-Negrete EA, Manzanilla-Ramírez MÁ, Velázquez-Monreal JJ, Méndez-Lozano J, Bejarano ER, Castillo AG, Claros MG, Leyva-López NE. De novo assembly and functional annotation of Citrus aurantifolia transcriptome from Candidatus Liberibacter asiaticus infected and non-infected trees. Data Brief 2020; 29:105198. [PMID: 32071978 PMCID: PMC7011030 DOI: 10.1016/j.dib.2020.105198] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Revised: 01/07/2020] [Accepted: 01/20/2020] [Indexed: 12/03/2022] Open
Abstract
Mexican lime (Citrus aurantifolia) belongs to the Rutaceae family and nowadays is one of the major commercial citrus crops in different countries. In Mexico, Mexican lime production is impaired by Huanglongbing (HLB) disease associated to Candidatus Liberibacter asiaticus (CLas) bacteria. To date, transcriptomic studies of CLas-Citrus interaction, have been performed mainly in sweet citrus models at symptomatic (early) stage where pleiotropic responses could mask important, pathogen-driven host modulation as well as, host antibacterial responses. Additionally, well-assembled reference transcriptomes for acid limes including C. aurantifolia are not available. The development of improved transcriptomic resources for CLas-citrus pathosystem, including both asymptomatic (early) and symptomatic (late) stages, could accelerate the understanding of the disease. Here, we provide the first transcriptomic analysis from healthy and HLB-infected C. aurantifolia leaves at both asymptomatic and symptomatic stages, using a RNA-seq approach in the Illumina NexSeq500 platform. The construction of the assembled transcriptome was conducted using the predesigned workflow Transflow and a total of 41,522 tentative transcripts (TTs) obtained. These C. aurantifolia TTs were functionally annotated using TAIR10 and UniProtKB databases. All raw reads were deposited in the NCBI SRA with accession numbers SRR10353556, SRR10353558, SRR10353560 and SRR10353562. Overall, this dataset adds new transcriptomic valuable tools for future breeding programs, will allow the design of novel diagnostic molecular markers, and will be an essential tool for studying the HLB disease.
Collapse
|
29
|
Dataset of de novo assembly and functional annotation of the transcriptome of certain developmental stages of coconut rhinoceros beetle, Oryctes rhinoceros L. Data Brief 2020; 28:105036. [PMID: 31921949 PMCID: PMC6948120 DOI: 10.1016/j.dib.2019.105036] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 12/12/2019] [Indexed: 11/20/2022] Open
Abstract
The coconut rhinoceros beetle, Oryctes rhinoceros L. (Insecta: Coleoptera: Scarabaeidae: Dynastinae) is one of the world's most important endemic and incessant pests of coconut (particularly in India and Southeast Asia), causing an estimated 10% yield loss in the crop. Various management strategies formulated and implemented to control this pest include bioagents, insecticide sprays, liquid formulations, pheromone traps, and botanical formulations. Also, potential microbial bioagents viz., Oryctes rhinoceros nudivirus (OrNV) and Metarhizium anisopliae have been implemented as biological control agents and this has led to a beneficial reduction of the pest population unless significant immigration occurs. To date, research and development activities are still on-going for the successful management of the pest; yet advances in understanding at the molecular level have been limited because basic genomic information is lacking for this cosmopolitan pest. Transcriptome approach has been proved extremely useful in finding potential genes for pest control. Transcriptome analysis aids in gaining insights into the transcriptional changes which occur during different developmental stages of an organism. We have performed RNA sequencing of certain different developmental stages of O. rhinoceros viz., early instar larva, late instar larva, pupa, and adult, in an Illumina HiSeq™ 2500 platform. Due to the unavailability of O. rhinoceros genome, the RNA-seq data generated were assembled de novo using Trinity and annotated following redundancy removal. A dataset of 87,451 transcripts, which resulted after redundancy removal, were annotated using the NCBI non-redundant (nr) protein and Uniprot databases. The data furnished could be used by others working in the development of pest management strategies, especially the identification of molecular targets for effective pest control. This information allows a better understanding of O. rhinoceros biology which would contribute to outlining a new generation of stage-specific, environmentally friendly pest management techniques.
Collapse
|
30
|
Step-by-Step Bioinformatics Analysis of Schistosoma mansoni Long Non-coding RNA Sequences. Methods Mol Biol 2020; 2151:109-133. [PMID: 32452000 DOI: 10.1007/978-1-0716-0635-3_10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
In the last few years, long non-coding RNAs (lncRNAs) have been widely studied in humans, and their relevance for physiological and pathological conditions has been demonstrated. In parasites, there are only a few works, such as in Plasmodium falciparum, where it was shown that an lncRNA regulates the expression of a gene associated with immune system evasion, also indicating the relevance of understanding the role of this class of RNAs in parasites. In Schistosoma mansoni, in the last 2 years, there were four published articles related to the annotation of lncRNAs in different life cycle stages using RNA-Seq libraries. In order to make this process of lncRNA identification and annotation more accessible to biologists with no bioinformatics training, considering the growing number of S. mansoni RNA-Seq libraries publicly available from different sources, such as ovary tissues from bi-sex and single-sex infections, and the potential of lncRNAs as therapeutic targets, we provide this step-by-step protocol of lncRNA identification and quantification. This guide includes the download of RNA-Seq libraries from a public database and reads processing and mapping against the genome, transcript reconstruction, novel lncRNA identification, transcripts expression level determination, and the identification of differentially expressed lncRNAs.
Collapse
|
31
|
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019; 20:278. [PMID: 31842956 PMCID: PMC6912988 DOI: 10.1186/s13059-019-1910-1] [Citation(s) in RCA: 695] [Impact Index Per Article: 139.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Accepted: 12/02/2019] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.
Collapse
|
32
|
Zhang M, Heikkinen L, Knott KE, Wong G. De novo transcriptome assembly of a facultative parasitic nematode Pelodera (syn. Rhabditis) strongyloides. Gene 2019; 710:30-38. [PMID: 31128222 DOI: 10.1016/j.gene.2019.05.041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Revised: 04/01/2019] [Accepted: 05/21/2019] [Indexed: 01/06/2023]
Abstract
Pelodera strongyloides is a generally free-living gonochoristic facultative nematode. The whole genomic sequence of P. strongyloides remains unknown but 4 small subunit ribosomal RNA (ssrRNA) gene sequences are available. This project launched a de novo transcriptome assembly with 100 bp paired-end RNA-seq reads from normal, starved and wet-plate cultured animals. Trinity assembly tool generated 104,634 transcript contigs with N50 contig being 2195 bp and average contig length at 1103 bp. Transcriptome BLASTX matching results of five nematodes (C. elegans, Strongyloides stercoralis, Necator americanus, Trichuris trichiura, and Pristionchus pacificus) were consistent with their evolutionary relationships. Sixteen genes were identified to be homologous to key elements of the C. elegans RNA interference system, such as Dicer, Argonaute, RNA-dependent RNA polymerase and double strand RNA transport proteins. In starved samples, we observed up-regulation of cuticle related genes and 3 dauer formation genes. Dauer morphology was captured with enlarged phasmid under light microscopy, and dauer and normal larvae counts in clumps had a Pearson's product-moment correlation of 0.805 with P-value = 0.0088. Our results demonstrate that P. strongyloides could be used for studying nematode-related human or pet parasitic diseases. The sequenced assembled transcriptome reported here may be useful to understand the evolution of parasitism in Nematoda.
Collapse
|
33
|
Dataset of de novo assembly and functional annotation of the transcriptome of blueberry ( Vaccinium spp.). Data Brief 2019; 25:104390. [PMID: 31497632 PMCID: PMC6718820 DOI: 10.1016/j.dib.2019.104390] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/25/2019] [Accepted: 08/05/2019] [Indexed: 11/22/2022] Open
Abstract
Blueberry is an economically important berry crop. Both production and consumption of blueberries have increased sharply worldwide in recent years at least partly due to their known health benefits. The development of improved genomic resources for blueberry, such as a well-assembled genome and transcriptome, could accelerate breeding through genomic-assisted approaches. To enrich available transcriptome data and identify genes potentially involved in fruit quality, RNA sequencing was performed on fruit tissue from two northern-adapted hybrid blueberry breeding populations. RNA-seq was carried out using the Illumina HiSeqTM 2500 platform. Because of the absence of a reference-grade genome for blueberry, a transcriptome was de novo assembled from this RNA-seq data and other publicly available transcriptome data from blueberry downloaded from the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) using Trinity. After removing redundancy, this resulted in a dataset of 91,861 blueberry unigenes. This unigene dataset was functionally annotated using the NCBI-Nr protein database. All raw reads from the breeding populations were deposited in the NCBI SRA with accession numbers SRR6281886, SRR6281887, SRR6281888, and SRR6281889. The de novo transcriptome assembly was deposited at NCBI Transcriptome Shotgun Assembly (TSA) database with accession number GGAB00000000. These data will provide real expression evidence for the blueberry genome gene prediction and gene functional annotation and a reference transcriptome for future gene expression studies involving blueberry fruit.
Collapse
|
34
|
Voshall A, Moriyama EN. Next-generation transcriptome assembly and analysis: Impact of ploidy. Methods 2019; 176:14-24. [PMID: 31176772 DOI: 10.1016/j.ymeth.2019.06.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2019] [Revised: 05/30/2019] [Accepted: 06/01/2019] [Indexed: 10/26/2022] Open
Abstract
Whole genome duplications (WGD) occur widely in plants, but the effects of these events impact all branches of life. WGD events have major evolutionary impacts, often leading to major structural changes within the chromosomes and massive changes in gene expression that facilitate rapid speciation and gene diversification. Even for species that currently have diploid genomes, the impact of ancestral duplication events is still present in the genomes, especially in the context of highly similar gene families that are retained from WGD. However, the impact of these ploidies on various bioinformatics workflows has not been studied well. In this review, we overview biological significance of polyploidy in different organisms. We describe the impact of having polyploid transcriptomes on bioinformatics analyses, especially focusing on transcriptome assembly and transcript quantification. We discuss the benefits of using simulated benchmarking data when we examine the performance of various methods. We also present an example strategy to generate simulated allopolyploid transcriptomes and RNAseq datasets and how these benchmark datasets can be used to assess the performance of transcript assembly and quantification methods. Our benchmarking study shows that all transcriptome assembly methods are affected by having polyploid genomes. Quantification accuracy is also impacted by polyploidy depending on the method. These simulated datasets can be adapted for testing, such as, read mapping, variant calling, and differential expression using biologically realistic conditions.
Collapse
|
35
|
Liu J, Yu T, Mu Z, Li G. TransLiG: a de novo transcriptome assembler that uses line graph iteration. Genome Biol 2019; 20:81. [PMID: 31014374 PMCID: PMC6480747 DOI: 10.1186/s13059-019-1690-7] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Accepted: 04/09/2019] [Indexed: 01/30/2023] Open
Abstract
We present TransLiG, a new de novo transcriptome assembler, which is able to integrate the sequence depth and pair-end information into the assembling procedure by phasing paths and iteratively constructing line graphs starting from splicing graphs. TransLiG is shown to be significantly superior to all the salient de novo assemblers in both accuracy and computing resources when tested on artificial and real RNA-seq data. TransLiG is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/ .
Collapse
|
36
|
Gilbert DG. Genes of the pig, Sus scrofa, reconstructed with EvidentialGene. PeerJ 2019; 7:e6374. [PMID: 30723633 PMCID: PMC6361002 DOI: 10.7717/peerj.6374] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2018] [Accepted: 12/29/2018] [Indexed: 01/19/2023] Open
Abstract
The pig is a well-studied model animal of biomedical and agricultural importance. Genes of this species, Sus scrofa, are known from experiments and predictions, and collected at the NCBI reference sequence database section. Gene reconstruction from transcribed gene evidence of RNA-seq now can accurately and completely reproduce the biological gene sets of animals and plants. Such a gene set for the pig is reported here, including human orthologs missing from current NCBI and Ensembl reference pig gene sets, additional alternate transcripts, and other improvements. Methodology for accurate and complete gene set reconstruction from RNA is used: the automated SRA2Genes pipeline of EvidentialGene project.
Collapse
|
37
|
Experimental validation of in silico predicted RAD locus frequencies using genomic resources and short read data from a model marine mammal. BMC Genomics 2019; 20:72. [PMID: 30669975 PMCID: PMC6341687 DOI: 10.1186/s12864-019-5440-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Accepted: 01/08/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Restriction site-associated DNA sequencing (RADseq) has revolutionized the study of wild organisms by allowing cost-effective genotyping of thousands of loci. However, for species lacking reference genomes, it can be challenging to select the restriction enzyme that offers the best balance between the number of obtained RAD loci and depth of coverage, which is crucial for a successful outcome. To address this issue, PredRAD was recently developed, which uses probabilistic models to predict restriction site frequencies from a transcriptome assembly or other sequence resource based on either GC content or mono-, di- or trinucleotide composition. This program generates predictions that are broadly consistent with estimates of the true number of restriction sites obtained through in silico digestion of available reference genome assemblies. However, in practice the actual number of loci obtained could potentially differ as incomplete enzymatic digestion or patchy sequence coverage across the genome might lead to some loci not being represented in a RAD dataset, while erroneous assembly could potentially inflate the number of loci. To investigate this, we used genome and transcriptome assemblies together with RADseq data from the Antarctic fur seal (Arctocephalus gazella) to compare PredRAD predictions with empirical estimates of the number of loci obtained via in silico digestion and from de novo assemblies. RESULTS PredRAD yielded consistently higher predicted numbers of restriction sites for the transcriptome assembly relative to the genome assembly. The trinucleotide and dinucleotide models also predicted higher frequencies than the mononucleotide or GC content models. Overall, the dinucleotide and trinucleotide models applied to the transcriptome and the genome assemblies respectively generated predictions that were closest to the number of restriction sites estimated by in silico digestion. Furthermore, the number of de novo assembled RAD loci mapping to restriction sites was similar to the expectation based on in silico digestion. CONCLUSIONS Our study reveals generally high concordance between PredRAD predictions and empirical estimates of the number of RAD loci. This further supports the utility of PredRAD, while also suggesting that it may be feasible to sequence and assemble the majority of RAD loci present in an organism's genome.
Collapse
|
38
|
Wang Y, Yang H, Zi C, Wang Z. Transcriptomic analysis of the red and green light responses in Columba livia domestica. 3 Biotech 2019; 9:20. [PMID: 30622858 DOI: 10.1007/s13205-018-1551-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 12/20/2018] [Indexed: 11/29/2022] Open
Abstract
In this study, 108 paired White King pigeons, randomly divided into three compartments were exposed to green light, red light, and white light followed by 15 h of light exposure, for a 6-month period. Three female birds from each group were selected and ovarian stromal tissue was collected. Pigeon reproductive data were also recorded every day. We performed transcriptome assembly on several tissue samples using Illumina Hiseq 2000 and analyzed differentially expressed genes involving follicle development mechanisms. Reproductive data confirmed that exposure to red and green lights improved pigeon reproduction. In total, approximately 158,080 unigenes with an average length of 753 bp were obtained using the Trinity program. Gene ontology, clusters of orthologous groups, and the Kyoto encyclopedia of genes were used to annotate and classify these unigenes. Large numbers of differentially expressed genes were discovered through pairwise comparisons between groups treated with monochromatic light versus white light. Some of these genes are associated with steroid hormone biosynthesis, cell cycle and circadian rhythm. Furthermore, qRT-PCR was used to detect the relative expression levels of randomly selected genes. A total of 17,419 potential simple sequence repeats were also identified. Our study provides insights into potential molecular mechanisms and genes that regulate pigeon reproduction in response to monochromatic light exposure. Our results and data will facilitate a further investigation into the molecular mechanisms behind the effects of red and green lights on follicle development and reproduction in the pigeon.
Collapse
|
39
|
Pertea M, Shumate A, Pertea G, Varabyou A, Breitwieser FP, Chang YC, Madugundu AK, Pandey A, Salzberg SL. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 2018; 19:208. [PMID: 30486838 PMCID: PMC6260756 DOI: 10.1186/s13059-018-1590-2] [Citation(s) in RCA: 162] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Accepted: 11/16/2018] [Indexed: 01/06/2023] Open
Abstract
We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .
Collapse
|
40
|
Chiu R, Nip KM, Chu J, Birol I. TAP: a targeted clinical genomics pipeline for detecting transcript variants using RNA-seq data. BMC Med Genomics 2018; 11:79. [PMID: 30200994 PMCID: PMC6131862 DOI: 10.1186/s12920-018-0402-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 08/31/2018] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND RNA-seq is a powerful and cost-effective technology for molecular diagnostics of cancer and other diseases, and it can reach its full potential when coupled with validated clinical-grade informatics tools. Despite recent advances in long-read sequencing, transcriptome assembly of short reads remains a useful and cost-effective methodology for unveiling transcript-level rearrangements and novel isoforms. One of the major concerns for adopting the proven de novo assembly approach for RNA-seq data in clinical settings has been the analysis turnaround time. To address this concern, we have developed a targeted approach to expedite assembly and analysis of RNA-seq data. RESULTS Here we present our Targeted Assembly Pipeline (TAP), which consists of four stages: 1) alignment-free gene-level classification of RNA-seq reads using BioBloomTools, 2) de novo assembly of individual targets using Trans-ABySS, 3) alignment of assembled contigs to the reference genome and transcriptome with GMAP and BWA and 4) structural and splicing variant detection using PAVFinder. We show that PAVFinder is a robust gene fusion detection tool when compared to established methods such as Tophat-Fusion and deFuse on simulated data of 448 events. Using the Leucegene acute myeloid leukemia (AML) RNA-seq data and a set of 580 COSMIC target genes, TAP identified a wide range of hallmark molecular anomalies including gene fusions, tandem duplications, insertions and deletions in agreement with published literature results. Moreover, also in this dataset, TAP captured AML-specific splicing variants such as skipped exons and novel splice sites reported in studies elsewhere. Running time of TAP on 100-150 million read pairs and a 580-gene set is one to 2 hours on a 48-core machine. CONCLUSIONS We demonstrated that TAP is a fast and robust RNA-seq variant detection pipeline that is potentially amenable to clinical applications. TAP is available at http://www.bcgsc.ca/platform/bioinfo/software/pavfinder.
Collapse
|
41
|
Visser EA, Wegrzyn JL, Myburg AA, Naidoo S. Defence transcriptome assembly and pathogenesis related gene family analysis in Pinus tecunumanii (low elevation). BMC Genomics 2018; 19:632. [PMID: 30139335 PMCID: PMC6108113 DOI: 10.1186/s12864-018-5015-0] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Accepted: 08/14/2018] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Fusarium circinatum is a pressing threat to the cultivation of many economically important pine tree species. Efforts to develop effective disease management strategies can be aided by investigating the molecular mechanisms involved in the host-pathogen interaction between F. circinatum and pine species. Pinus tecunumanii and Pinus patula are two closely related tropical pine species that differ widely in their resistance to F. circinatum challenge, being resistant and susceptible respectively, providing the potential for a useful pathosystem to investigate the molecular responses underlying resistance to F. circinatum. However, no genomic resources are available for P. tecunumanii. Pathogenesis-related proteins are classes of proteins that play important roles in plant-microbe interactions, e.g. chitinases; proteins that break down the major structural component of fungal cell walls. Generating a reference sequence for P. tecunumanii and characterizing pathogenesis related gene families in these two pine species is an important step towards unravelling the pine-F. circinatum interaction. RESULTS Eight reference based and 12 de novo assembled transcriptomes were produced, for juvenile shoot tissue from both species. EvidentialGene pipeline redundancy reduction, expression filtering, protein clustering and taxonomic filtering produced a 50 Mb shoot transcriptome consisting of 28,621 contigs for P. tecunumanii and a 72 Mb shoot transcriptome consisting of 52,735 contigs for P. patula. Predicted protein sequences encoded by the assembled transcriptomes were clustered with reference proteomes from 92 other species to identify pathogenesis related gene families in P. patula, P. tecunumanii and other pine species. CONCLUSIONS The P. tecunumanii transcriptome is the first gene catalogue for the species, representing an important resource for studying resistance to the pitch canker pathogen, F. circinatum. This study also constitutes, to our knowledge, the largest index of gymnosperm PR-genes to date.
Collapse
|
42
|
Hacking J, Bertozzi T, Moussalli A, Bradford T, Gardner M. Characterisation of major histocompatibility complex class I transcripts in an Australian dragon lizard. DEVELOPMENTAL AND COMPARATIVE IMMUNOLOGY 2018; 84:164-171. [PMID: 29454831 DOI: 10.1016/j.dci.2018.02.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Revised: 02/10/2018] [Accepted: 02/10/2018] [Indexed: 06/08/2023]
Abstract
Characterisation of squamate major histocompatibility complex (MHC) genes has lagged behind other taxonomic groups. MHC genes encode cell-surface glycoproteins that present self- and pathogen-derived peptides to T cells and play a critical role in pathogen recognition. Here we characterise MHC class I transcripts for an agamid lizard (Ctenophorus decresii) and investigate the evolution of MHC class I in Iguanian lizards. An iterative assembly strategy was used to identify six full-length C. decresii MHC class I transcripts, which were validated as likely to encode classical class I MHC molecules. Evidence for exon shuffling recombination was uncovered for C. decresii transcripts and Bayesian phylogenetic analysis of Iguanian MHC class I sequences revealed a pattern expected under a birth-and-death mode of evolution. This work provides a stepping stone towards further research on the agamid MHC class I region.
Collapse
|
43
|
De novo transcriptome assembly and identification of salt-responsive genes in sugar beet M14. Comput Biol Chem 2018; 75:1-10. [PMID: 29705503 DOI: 10.1016/j.compbiolchem.2018.04.014] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2017] [Revised: 01/06/2018] [Accepted: 04/21/2018] [Indexed: 11/21/2022]
Abstract
Sugar beet (Beta vulgaris) is an important crop of sugar production in the world. Previous studies reported that sugar beet monosomic addition line M14 obtained from the intercross between Beta vulgaris L. (cultivated species) and B. corolliflora Zoss (wild species) exhibited tolerance to salt (up to 0.5 M NaCl) stress. To estimate a broad spectrum of genes involved in the M14 salt tolerance will help elucidate the molecular mechanisms underlying salt stress. Comparative transcriptomics was performed to monitor genes differentially expressed in the leaf and root samples of the sugar beet M14 seedlings treated with 0, 200 and 400 mM NaCl, respectively. Digital gene expression revealed that 3856 unigenes in leaves and 7157 unigenes in roots were differentially expressed under salt stress. Enrichment analysis of the differentially expressed genes based on GO and KEGG databases showed that in both leaves and roots genes related to regulation of redox balance, signal transduction, and protein phosphorylation were differentially expressed. Comparison of gene expression in the leaf and root samples treated with 200 and 400 mM NaCl revealed different mechanisms for coping with salt stress. In addition, the expression levels of nine unigenes in the reactive oxygen species (ROS) scavenging system exhibited significant differences in the leaves and roots. Our transcriptomics results have provided new insights into the salt-stress responses in the leaves and roots of sugar beet.
Collapse
|
44
|
Buisine N, Kerdivel G, Sachs LM. De Novo Transcriptomic Approach to Study Thyroid Hormone Receptor Action in Non-mammalian Models. Methods Mol Biol 2018; 1801:265-285. [PMID: 29892831 DOI: 10.1007/978-1-4939-7902-8_21] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Thyroid hormones are pleiotropic hormones involved in chordates physiology. Understanding their functions and mechanisms is also instrumental to diagnose dys-regulations and get a predictive power that can applied to medicine, ecology, etc. Today, high-throughput sequencing technologies offer the opportunity to address this issue not only in model organisms but also in non-model organisms. Here, we describe a method that makes use of RNA-seq to address differential expression analysis in non-model organism.
Collapse
|
45
|
Abstract
Proper control of microRNA (miRNA) expression is critical for normal development and physiology, while abnormal miRNA expression is a common feature of many diseases. Dissecting mechanisms of miRNA regulation, however, is complicated by the generally poor annotation of miRNA primary transcripts (pri-miRNAs). Although some miRNAs are processed from well-defined protein coding genes, the majority of pri-miRNAs are poorly characterized noncoding RNAs, with incomplete annotation of promoters, splice sites, and polyadenylation signals. Due to the efficiency of DROSHA processing, the abundance of pri-miRNAs is very low at steady state, thereby complicating the elucidation of pri-miRNA structures. Here we describe a strategy to enrich intact pri-miRNAs and improve their coverage in RNA sequencing (RNA-seq) experiments. In addition, we outline a computational approach for reconstruction of pri-miRNA structures. This pipeline begins with raw RNA-seq reads and concludes with publication-ready visualization of pri-miRNA annotations. Together, these approaches allow the user to define and explore miRNA gene structures in a cell-type or organism of interest.
Collapse
|
46
|
Kollmar M, Simm D. Identifying Sequenced Eukaryotic Genomes and Transcriptomes with diArk. Methods Mol Biol 2018; 1757:1-19. [PMID: 29761453 DOI: 10.1007/978-1-4939-7737-6_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The diArk Eukaryotic Genome Database is a manually curated and updated repository of available eukaryotic genome and transcriptome assemblies. diArk is a key resource for researchers interested in comparative eukaryotic genomics, and the entry point to browsing sequenced eukaryotes in general and to find the most closely related species to the own organism of interest in particular. The exponentially increasing number of sequenced species demands sophisticated search and data presentation tools. In this chapter we describe how to navigate the diArk database keeping a first-time user in mind.
Collapse
|
47
|
Stavrianakou M, Perez R, Wu C, Sachs MS, Aramayo R, Harlow M. Draft de novo transcriptome assembly and proteome characterization of the electric lobe of Tetronarce californica: a molecular tool for the study of cholinergic neurotransmission in the electric organ. BMC Genomics 2017; 18:611. [PMID: 28806931 PMCID: PMC5557070 DOI: 10.1186/s12864-017-3890-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2016] [Accepted: 06/21/2017] [Indexed: 11/10/2022] Open
Abstract
Background The electric organ of Tetronarce californica (an electric ray formerly known as Torpedo californica) is a classic preparation for biochemical studies of cholinergic neurotransmission. To broaden the usefulness of this preparation, we have performed a transcriptome assembly of the presynaptic component of the electric organ (the electric lobe). We combined our assembled transcriptome with a previous transcriptome of the postsynaptic electric organ, to define a MetaProteome containing pre- and post-synaptic components of the electric organ. Results Sequencing yielded 102 million paired-end 100 bp reads. De novo Trinity assembly was performed at Kmer 25 (default) and Kmers 27, 29, and 31. Trinity, generated around 103,000 transcripts, and 78,000 genes per assembly. Assemblies were evaluated based on the number of bases/transcripts assembled, RSEM-EVAL scores and informational content and completeness. We found that different assemblies scored differently according to the evaluation criteria used, and that while each individual assembly contained unique information, much of the assembly information was shared by all assemblies. To generate the presynaptic transcriptome (electric lobe), while capturing all information, assemblies were first clustered and then combined with postsynaptic transcripts (electric organ) downloaded from NCBI. The completness of the resulting clustered predicted MetaProteome was rigorously evaluated by comparing its information against the predicted proteomes from Homo sapiens, Callorhinchus milli, and the Transporter Classification Database (TCDB). Conclusions In summary, we obtained a MetaProteome containing 92%, 88.5%, and 66% of the expected set of ultra-conserved sequences (i.e., BUSCOs), expected to be found for Eukaryotes, Metazoa, and Vertebrata, respectively. We cross-annotated the conserved set of proteins shared between the T. californica MetaProteome and the proteomes of H. sapiens and C. milli, using the H. sapiens genome as a reference. This information was used to predict the position in human pathways of the conserved members of the T. californica MetaProteome. We found proteins not detected before in T. californica, corresponding to processes involved in synaptic vesicle biology. Finally, we identified 42 transporter proteins in TCDB that were detected by the T. californica MetaProteome (electric fish) and not selected by a control proteome consisting of the combined proteomes of 12 widely diverse non-electric fishes by Reverse-Blast-Hit Blast. Combined, the information provided here is not only a unique tool for the study of cholinergic neurotransmission, but it is also a starting point for understanding the evolution of early vertebrates. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3890-4) contains supplementary material, which is available to authorized users.
Collapse
|
48
|
Sze SH, Pimsler ML, Tomberlin JK, Jones CD, Tarone AM. A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms. BMC Genomics 2017; 18:387. [PMID: 28589866 PMCID: PMC5461550 DOI: 10.1186/s12864-017-3735-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries. Results We develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory. Conclusions Our strategy minimizes memory consumption while simultaneously obtaining comparable or improved accuracy over existing algorithms. It provides support for incremental updates of assemblies when new libraries become available.
Collapse
|
49
|
Hoang NV, Furtado A, Mason PJ, Marquardt A, Kasirajan L, Thirugnanasambandam PP, Botha FC, Henry RJ. A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 2017; 18:395. [PMID: 28532419 PMCID: PMC5440902 DOI: 10.1186/s12864-017-3757-8] [Citation(s) in RCA: 131] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 05/03/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Despite the economic importance of sugarcane in sugar and bioenergy production, there is not yet a reference genome available. Most of the sugarcane transcriptomic studies have been based on Saccharum officinarum gene indices (SoGI), expressed sequence tags (ESTs) and de novo assembled transcript contigs from short-reads; hence knowledge of the sugarcane transcriptome is limited in relation to transcript length and number of transcript isoforms. RESULTS The sugarcane transcriptome was sequenced using PacBio isoform sequencing (Iso-Seq) of a pooled RNA sample derived from leaf, internode and root tissues, of different developmental stages, from 22 varieties, to explore the potential for capturing full-length transcript isoforms. A total of 107,598 unique transcript isoforms were obtained, representing about 71% of the total number of predicted sugarcane genes. The majority of this dataset (92%) matched the plant protein database, while just over 2% was novel transcripts, and over 2% was putative long non-coding RNAs. About 56% and 23% of total sequences were annotated against the gene ontology and KEGG pathway databases, respectively. Comparison with de novo contigs from Illumina RNA-Sequencing (RNA-Seq) of the internode samples from the same experiment and public databases showed that the Iso-Seq method recovered more full-length transcript isoforms, had a higher N50 and average length of largest 1,000 proteins; whereas a greater representation of the gene content and RNA diversity was captured in RNA-Seq. Only 62% of PacBio transcript isoforms matched 67% of de novo contigs, while the non-matched proportions were attributed to the inclusion of leaf/root tissues and the normalization in PacBio, and the representation of more gene content and RNA classes in the de novo assembly, respectively. About 69% of PacBio transcript isoforms and 41% of de novo contigs aligned with the sorghum genome, indicating the high conservation of orthologs in the genic regions of the two genomes. CONCLUSIONS The transcriptome dataset should contribute to improved sugarcane gene models and sugarcane protein predictions; and will serve as a reference database for analysis of transcript expression in sugarcane.
Collapse
|
50
|
Cribbin KM, Quackenbush CR, Taylor K, Arias-Rodriguez L, Kelley JL. Sex-specific differences in transcriptome profiles of brain and muscle tissue of the tropical gar. BMC Genomics 2017; 18:283. [PMID: 28388875 PMCID: PMC5383948 DOI: 10.1186/s12864-017-3652-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 03/22/2017] [Indexed: 02/06/2023] Open
Abstract
Background The tropical gar (Atractosteus tropicus) is the southernmost species of the seven extant species of gar fishes in the world. In Mexico and Central America, the species is an important food source due to its nutritional quality and low price. Despite its regional importance and increasing concerns about overexploitation and habitat degradation, basic genetic information on the tropical gar is lacking. Determining genetic information on the tropical gar is important for the sustainable management of wild populations, implementation of best practices in aquaculture settings, evolutionary studies of ancient lineages, and an understanding of sex-specific gene expression. In this study, the transcriptome of the tropical gar was sequenced and assembled de novo using tissues from three males and three females using Illumina sequencing technology. Sex-specific and highly differentially expressed transcripts in brain and muscle tissues between adult males and females were subsequently identified. Results The transcriptome was assembled de novo resulting in 80,611 transcripts with a contig N50 of 3,355 base pairs and over 168 kilobases in total length. Male muscle, brain, and gonad as well as female muscle and brain were included in the assembly. The assembled transcriptome was annotated to identify the putative function of expressed transcripts using Trinotate and SwissProt, a database of well-annotated proteins. The brain and muscle datasets were then aligned to the assembled transcriptome to identify transcripts that were differentially expressed between males and females. The contrast between male and female brain identified 109 transcripts from 106 genes that were significantly differentially expressed. In the muscle comparison, 82 transcripts from 80 genes were identified with evidence for significant differential expression. Almost all genes identified as differentially expressed were sex-specific. The differentially expressed transcripts were enriched for genes involved in cellular functioning, signaling, immune response, and tissue-specific functions. Conclusions This study identified differentially expressed transcripts between male and female gar in muscle and brain tissue. The majority of differentially expressed transcripts had sex-specific expression. Expanding on these findings to other developmental stages, populations, and species may lead to the identification of genetic factors contributing to the skewed sex ratio seen in the tropical gar and of sex-specific differences in expression in other species. Finally, the transcriptome assembly will open future research avenues on tropical gar development, cell function, environmental resistance, and evolution in the context of other early vertebrates. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3652-3) contains supplementary material, which is available to authorized users.
Collapse
|