1
|
Jackson DJ, Cerveau N, Posnien N. De novo assembly of transcriptomes and differential gene expression analysis using short-read data from emerging model organisms - a brief guide. Front Zool 2024; 21:17. [PMID: 38902827 PMCID: PMC11188175 DOI: 10.1186/s12983-024-00538-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 06/12/2024] [Indexed: 06/22/2024] Open
Abstract
Many questions in biology benefit greatly from the use of a variety of model systems. High-throughput sequencing methods have been a triumph in the democratization of diverse model systems. They allow for the economical sequencing of an entire genome or transcriptome of interest, and with technical variations can even provide insight into genome organization and the expression and regulation of genes. The analysis and biological interpretation of such large datasets can present significant challenges that depend on the 'scientific status' of the model system. While high-quality genome and transcriptome references are readily available for well-established model systems, the establishment of such references for an emerging model system often requires extensive resources such as finances, expertise and computation capabilities. The de novo assembly of a transcriptome represents an excellent entry point for genetic and molecular studies in emerging model systems as it can efficiently assess gene content while also serving as a reference for differential gene expression studies. However, the process of de novo transcriptome assembly is non-trivial, and as a rule must be empirically optimized for every dataset. For the researcher working with an emerging model system, and with little to no experience with assembling and quantifying short-read data from the Illumina platform, these processes can be daunting. In this guide we outline the major challenges faced when establishing a reference transcriptome de novo and we provide advice on how to approach such an endeavor. We describe the major experimental and bioinformatic steps, provide some broad recommendations and cautions for the newcomer to de novo transcriptome assembly and differential gene expression analyses. Moreover, we provide an initial selection of tools that can assist in the journey from raw short-read data to assembled transcriptome and lists of differentially expressed genes.
Collapse
Affiliation(s)
- Daniel J Jackson
- University of Göttingen, Department of Geobiology, Goldschmidtstr.3, Göttingen, 37077, Germany.
| | - Nicolas Cerveau
- University of Göttingen, Department of Geobiology, Goldschmidtstr.3, Göttingen, 37077, Germany
| | - Nico Posnien
- University of Göttingen, Department of Developmental Biology, GZMB, Justus-Von-Liebig-Weg 11, Göttingen, 37077, Germany.
| |
Collapse
|
2
|
Moraga C, Sanchez E, Ferrarini MG, Gutierrez RA, Vidal EA, Sagot MF. BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data. Gigascience 2022; 11:6773084. [PMID: 36283679 PMCID: PMC9596168 DOI: 10.1093/gigascience/giac093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 11/08/2021] [Accepted: 09/15/2022] [Indexed: 11/04/2022] Open
Abstract
MicroRNAs (miRNAs) are small noncoding RNAs that are key players in the regulation of gene expression. In the past decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, most of which rely on preexisting reference genomes. However, when a reference genome is absent or is not of high quality, such identification becomes more difficult. In this context, we developed BrumiR, an algorithm that is able to discover miRNAs directly and exclusively from small RNA (sRNA) sequencing (sRNA-seq) data. We benchmarked BrumiR with datasets encompassing animal and plant species using real and simulated sRNA-seq experiments. The results demonstrate that BrumiR reaches the highest recall for miRNA discovery, while at the same time being much faster and more efficient than the state-of-the-art tools evaluated. The latter allows BrumiR to analyze a large number of sRNA-seq experiments, from plants or animal species. Moreover, BrumiR detects additional information regarding other expressed sequences (sRNAs, isomiRs, etc.), thus maximizing the biological insight gained from sRNA-seq experiments. Additionally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2reference) that performs an a posteriori exhaustive search to identify the precursor sequences. Finally, we also provide a machine learning classifier based on a random forest model that evaluates the sequence-derived features to further refine the prediction obtained from the BrumiR-core. The code of BrumiR and all the algorithms that compose the BrumiR toolkit are freely available at https://github.com/camoragaq/BrumiR.
Collapse
Affiliation(s)
| | - Evelyn Sanchez
- Centro de Genómica y Bioinformática, Facultad de Ciencias, Ingenieria y Tecnologia, Universidad Mayor, 8580745 Santiago, Chile,Agencia Nacional de Investigación y Desarrollo–Millennium Science Initiative Program, Millennium Institute for Integrative Biology iBio, 7500565 Santiago, Chile
| | - Mariana Galvão Ferrarini
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622 Villeurbanne, France,Inria Lyon Centre, ERABLE team, 56 Bd Niels Bohr, 69100 Villeurbanne, France,Université de Lyon, INSA-Lyon, INRA, BF2i, UMR0203, Villeurbanne F-69621, France
| | - Rodrigo A Gutierrez
- Agencia Nacional de Investigación y Desarrollo–Millennium Science Initiative Program, Millennium Institute for Integrative Biology iBio, 7500565 Santiago, Chile,Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile , 8331010 Santiago, Chile,Fondo de Desarrollo de Areas Prioritarias, Center for Genome Regulation, Instituto de Ecología y Biodiversidad, 8370415 Santiago, Chile
| | - Elena A Vidal
- Centro de Genómica y Bioinformática, Facultad de Ciencias, Ingenieria y Tecnologia, Universidad Mayor, 8580745 Santiago, Chile,Agencia Nacional de Investigación y Desarrollo–Millennium Science Initiative Program, Millennium Institute for Integrative Biology iBio, 7500565 Santiago, Chile,Escuela de Biotecnología, Facultad de Ciencias, Ingenieria y Tecnologia, Universidad Mayor, 8580745 Santiago, Chile
| | | |
Collapse
|
3
|
Carvalho-Costa TM, Tiveron RDR, Mendes MT, Barbosa CG, Nevoa JC, Roza GA, Silva MV, Figueiredo HCP, Rodrigues V, Soares SDC, Oliveira CJF. Salivary and Intestinal Transcriptomes Reveal Differential Gene Expression in Starving, Fed and Trypanosoma cruzi-Infected Rhodnius neglectus. Front Cell Infect Microbiol 2022; 11:773357. [PMID: 34988032 PMCID: PMC8722679 DOI: 10.3389/fcimb.2021.773357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 11/04/2021] [Indexed: 11/28/2022] Open
Abstract
Rhodnius neglectus is a potential vector of Trypanosoma cruzi (Tc), the causative agent of Chagas disease. The salivary glands (SGs) and intestine (INT) are actively required during blood feeding. The saliva from SGs is injected into the vertebrate host, modulating immune responses and favoring feeding for INT digestion. Tc infection significantly alters the physiology of these tissues; however, studies that assess this are still scarce. This study aimed to gain a better understanding of the global transcriptional expression of genes in SGs and INT during fasting (FA), fed (FE), and fed in the presence of Tc (FE + Tc) conditions. In FA, the expression of transcripts related to homeostasis maintenance proteins during periods of stress was predominant. Therefore, the transcript levels of Tret1-like and Hsp70Ba proteins were increased. Blood appeared to be responsible for alterations found in the FE group, as most of the expressed transcripts, such as proteases and cathepsin D, were related to digestion. In FE + Tc group, there was a decreased expression of blood processing genes for insect metabolism (e.g., Antigen-5 precursor, Pr13a, and Obp), detoxification (Sult1) in INT and acid phosphatases in SG. We also found decreased transcriptional expression of lipocalins and nitrophorins in SG and two new proteins, pacifastin and diptericin, in INT. Several transcripts of unknown proteins with investigative potential were found in both tissues. Our results also show that the presence of Tc can change the expression in both tissues for a long or short period of time. While SG homeostasis seems to be re-established on day 9, changes in INT are still evident. The findings of this study may be used for future research on parasite-vector interactions and contribute to the understanding of food physiology and post-meal/infection in triatomines.
Collapse
Affiliation(s)
- Tamires Marielem Carvalho-Costa
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Rafael Destro Rosa Tiveron
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Maria Tays Mendes
- Biomedical Research Center, The University of Texas at El Paso, El Paso, TX, United States
| | - Cecília Gomes Barbosa
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Jessica Coraiola Nevoa
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Guilherme Augusto Roza
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Marcos Vinícius Silva
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | | | - Virmondes Rodrigues
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Siomar de Castro Soares
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| | - Carlo José Freire Oliveira
- Laboratory of Immunology and Bioinformatics, Institute of Biological and Natural Sciences, Federal University of Triangulo Mineiro, Uberaba, Brazil
| |
Collapse
|
4
|
Voshall A, Behera S, Li X, Yu XH, Kapil K, Deogun JS, Shanklin J, Cahoon EB, Moriyama EN. A consensus-based ensemble approach to improve transcriptome assembly. BMC Bioinformatics 2021; 22:513. [PMID: 34674629 PMCID: PMC8532302 DOI: 10.1186/s12859-021-04434-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 10/10/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes. RESULTS In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble. CONCLUSIONS Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genome-guided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from: http://bioinfolab.unl.edu/emlab/consemble/ .
Collapse
Affiliation(s)
- Adam Voshall
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital/Harvard Medical School, Boston, MA, 02115, USA
| | - Sairam Behera
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Xiangjun Li
- Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Xiao-Hong Yu
- Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, 11794, USA
| | - Kushagra Kapil
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Jitender S Deogun
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - John Shanklin
- Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA
| | - Edgar B Cahoon
- Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Etsuko N Moriyama
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA. .,Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
| |
Collapse
|
5
|
Behera S, Voshall A, Moriyama EN. Plant Transcriptome Assembly: Review and Benchmarking. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
6
|
Galise TR, Esposito S, D'Agostino N. Guidelines for Setting Up a mRNA Sequencing Experiment and Best Practices for Bioinformatic Data Analysis. Methods Mol Biol 2021; 2264:137-162. [PMID: 33263908 DOI: 10.1007/978-1-0716-1201-9_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
RNA-sequencing, commonly referred to as RNA-seq, is the most recently developed method for the analysis of transcriptomes. It uses high-throughput next-generation sequencing technologies and has revolutionized our understanding of the complexity and dynamics of whole transcriptomes.In this chapter, we recall the key developments in transcriptome analysis and dissect the different steps of the general workflow that can be run by users to design and perform a mRNA-seq experiment as well as to process mRNA-seq data obtained by the Illumina technology. The chapter proposes guidelines for completing a mRNA-seq study properly and makes available recommendations for best practices based on recent literature and on the latest developments in technology and algorithms. We also remark the large number of choices available (especially for bioinformatic data analysis) in front of which the scientist may be in trouble.In the last part of the chapter we discuss the new frontiers of single-cell RNA-seq and isoform sequencing by long read technology.
Collapse
Affiliation(s)
- Teresa Rosa Galise
- Department of Agricultural Sciences, University of Naples Federico II, Portici, Italy
| | - Salvatore Esposito
- CREA Research Centre for Vegetable and Ornamental Crops, Pontecagnano Faiano, Italy
| | - Nunzio D'Agostino
- Department of Agricultural Sciences, University of Naples Federico II, Portici, Italy.
| |
Collapse
|
7
|
Mora-Márquez F, Vázquez-Poletti JL, Chano V, Collada C, Soto Á, de Heredia UL. Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud. Curr Bioinform 2020. [DOI: 10.2174/1574893615666191219095817] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Bioinformatics software for RNA-seq analysis has a high computational
requirement in terms of the number of CPUs, RAM size, and processor characteristics.
Specifically, de novo transcriptome assembly demands large computational infrastructure due to
the massive data size, and complexity of the algorithms employed. Comparative studies on the
quality of the transcriptome yielded by de novo assemblers have been previously published,
lacking, however, a hardware efficiency-oriented approach to help select the assembly hardware
platform in a cost-efficient way.
Objective:
We tested the performance of two popular de novo transcriptome assemblers, Trinity
and SOAPdenovo-Trans (SDNT), in terms of cost-efficiency and quality to assess limitations, and
provided troubleshooting and guidelines to run transcriptome assemblies efficiently.
Methods:
We built virtual machines with different hardware characteristics (CPU number, RAM
size) in the Amazon Elastic Compute Cloud of the Amazon Web Services. Using simulated and
real data sets, we measured the elapsed time, cost, CPU percentage and output size of small and
large data set assemblies.
Results:
For small data sets, SDNT outperformed Trinity by an order the magnitude, significantly
reducing the time duration and costs of the assembly. For large data sets, Trinity performed better
than SDNT. Both the assemblers provide good quality transcriptomes.
Conclusion:
The selection of the optimal transcriptome assembler and provision of computational
resources depend on the combined effect of size and complexity of RNA-seq experiments.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - José Luis Vázquez-Poletti
- GI Arquitectura de Sistemas Distribuidos, Dpto. Arquitectura de Computadores y Automatica, Facultad de Informatica, Universidad Complutense de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Víctor Chano
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Carmen Collada
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Álvaro Soto
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Unai López de Heredia
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| |
Collapse
|
8
|
Hernandez-Escribano L, Visser EA, Iturritxa E, Raposo R, Naidoo S. The transcriptome of Pinus pinaster under Fusarium circinatum challenge. BMC Genomics 2020; 21:28. [PMID: 31914917 PMCID: PMC6950806 DOI: 10.1186/s12864-019-6444-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Accepted: 12/30/2019] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Fusarium circinatum, the causal agent of pitch canker disease, poses a serious threat to several Pinus species affecting plantations and nurseries. Although Pinus pinaster has shown moderate resistance to F. circinatum, the molecular mechanisms of defense in this host are still unknown. Phytohormones produced by the plant and by the pathogen are known to play a crucial role in determining the outcome of plant-pathogen interactions. Therefore, the aim of this study was to determine the role of phytohormones in F. circinatum virulence, that compromise host resistance. RESULTS A high quality P. pinaster de novo transcriptome assembly was generated, represented by 24,375 sequences from which 17,593 were full length genes, and utilized to determine the expression profiles of both organisms during the infection process at 3, 5 and 10 days post-inoculation using a dual RNA-sequencing approach. The moderate resistance shown by Pinus pinaster at the early time points may be explained by the expression profiles pertaining to early recognition of the pathogen, the induction of pathogenesis-related proteins and the activation of complex phytohormone signaling pathways that involves crosstalk between salicylic acid, jasmonic acid, ethylene and possibly auxins. Moreover, the expression of F. circinatum genes related to hormone biosynthesis suggests manipulation of the host phytohormone balance to its own benefit. CONCLUSIONS We hypothesize three key steps of host manipulation: perturbing ethylene homeostasis by fungal expression of genes related to ethylene biosynthesis, blocking jasmonic acid signaling by coronatine insensitive 1 (COI1) suppression, and preventing salicylic acid biosynthesis from the chorismate pathway by the synthesis of isochorismatase family hydrolase (ICSH) genes. These results warrant further testing in F. circinatum mutants to confirm the mechanism behind perturbing host phytohormone homeostasis.
Collapse
Affiliation(s)
- Laura Hernandez-Escribano
- Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria, Centro de Investigación Forestal (INIA-CIFOR), Madrid, Spain
- Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid, Madrid, Spain
| | - Erik A Visser
- Department of Biochemistry, Genetics and Microbiology, Forestry and Agricultural Biotechnology Institute (FABI), Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa
| | - Eugenia Iturritxa
- NEIKER, Granja Modelo de Arkaute, Apdo 46, 01080, Vitoria-Gasteiz, Spain
| | - Rosa Raposo
- Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria, Centro de Investigación Forestal (INIA-CIFOR), Madrid, Spain
- Instituto de Gestión Forestal Sostenible (iuFOR), Universidad de Valladolid/INIA, Valladolid, Spain
| | - Sanushka Naidoo
- Department of Biochemistry, Genetics and Microbiology, Forestry and Agricultural Biotechnology Institute (FABI), Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa.
| |
Collapse
|
9
|
Carvajal-Lopez P, Von Borstel FD, Torres A, Rustici G, Gutierrez J, Romero-Vivas E. Microarray-Based Quality Assessment as a Supporting Criterion for de novo Transcriptome Assembly Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:198-206. [PMID: 30059314 DOI: 10.1109/tcbb.2018.2860997] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
RNA-Sequencing and de novo assembly have enabled the analysis of species with non-available reference transcriptomes, although intrinsic features (biological and technical) induce errors in the reconstruction. A strategy to resolve these errors consists of varying assembling process parameters to generate multiple reconstructions. However, the best assembly selection remains a challenge. Quantitative metrics for quality assessment have been inconsistent when compared with pertinent references. In this paper, a criterion for supporting assembly selection based on mapping DNA microarray hybridized probes to assembly sets is proposed. Mouse and fruit fly RNA-Seq datasets were assembled with standard de novo procedures. Quality assessment was estimated using quantitative metrics and the proposed criterion. The assembly that best mapped to the available reference transcriptomes of these model species provided the highest quality assembly. The hybridized probes identified the best assemblies, whereas quantitative metrics remained inconsistent. For example, subtle probe mapping difference of 0.25 percent, but statistically significant (ANOVA, p < 0.05), enabled the assembly selection that led to identify 3,719 more contigs and led to 1,049 further mapped contigs to the mouse reference transcriptome. The microarray data availability for non-model species makes the proposed criterion suitable for quality assessment of multiple de novo assembly strategies.
Collapse
|
10
|
Malik L, Almodaresi F, Patro R. Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics 2019; 34:3265-3272. [PMID: 29746620 DOI: 10.1093/bioinformatics/bty378] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Accepted: 05/03/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation De novo transcriptome analysis using RNA-seq offers a promising means to study gene expression in non-model organisms. Yet, the difficulty of transcriptome assembly means that the contigs provided by the assembler often represent a fractured and incomplete view of the transcriptome, complicating downstream analysis. We introduce Grouper, a new method for clustering contigs from de novo assemblies that are likely to belong to the same transcripts and genes; these groups can subsequently be analyzed more robustly. When provided with access to the genome of a related organism, Grouper can transfer annotations to the de novo assembly, further improving the clustering. Results On de novo assemblies from four different species, we show that Grouper is able to accurately cluster a larger number of contigs than the existing state-of-the-art method. The Grouper pipeline is able to map greater than 10% more reads against the contigs, leading to accurate downstream differential expression analyses. The labeling module, in the presence of a closely related annotated genome, can efficiently transfer annotations to the contigs and use this information to further improve clustering. Overall, Grouper provides a complete and efficient pipeline for processing de novo transcriptomic assemblies. Availability and implementation The Grouper software is freely available at https://github.com/COMBINE-lab/grouper under the 2-clause BSD license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Laraib Malik
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
11
|
Durai DA, Schulz MH. In silico read normalization using set multi-cover optimization. Bioinformatics 2019; 34:3273-3280. [PMID: 29912280 PMCID: PMC6157080 DOI: 10.1093/bioinformatics/bty307] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Accepted: 04/18/2018] [Indexed: 11/24/2022] Open
Abstract
Motivation De Bruijn graphs are a common assembly data structure for sequencing datasets. But with the advances in sequencing technologies, assembling high coverage datasets has become a computational challenge. Read normalization, which removes redundancy in datasets, is widely applied to reduce resource requirements. Current normalization algorithms, though efficient, provide no guarantee to preserve important k-mers that form connections between regions in the graph. Results Here, normalization is phrased as a set multi-cover problem on reads and a heuristic algorithm, Optimized Read Normalization Algorithm (ORNA), is proposed. ORNA normalizes to the minimum number of reads required to retain all k-mers and their relative k-mer abundances from the original dataset. Hence, all connections from the original graph are preserved. ORNA was tested on various RNA-seq datasets with different coverage values. It was compared to the current normalization algorithms and was found to be performing better. Normalizing error corrected data allows for more accurate assemblies compared to the normalized uncorrected dataset. Further, an application is proposed in which multiple datasets are combined and normalized to predict novel transcripts that would have been missed otherwise. Finally, ORNA is a general purpose normalization algorithm that is fast and significantly reduces datasets with loss of assembly quality in between [1, 30]% depending on reduction stringency. Availability and implementation ORNA is available at https://github.com/SchulzLab/ORNA. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dilip A Durai
- Cluster of Excellence on Multimodal Computing and Interaction, Saarland University, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany.,Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Marcel H Schulz
- Cluster of Excellence on Multimodal Computing and Interaction, Saarland University, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
12
|
Ciezarek AG, Osborne OG, Shipley ON, Brooks EJ, Tracey SR, McAllister JD, Gardner LD, Sternberg MJE, Block B, Savolainen V. Phylotranscriptomic Insights into the Diversification of Endothermic Thunnus Tunas. Mol Biol Evol 2019; 36:84-96. [PMID: 30364966 PMCID: PMC6340463 DOI: 10.1093/molbev/msy198] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Birds, mammals, and certain fishes, including tunas, opahs and lamnid sharks, are endothermic, conserving internally generated, metabolic heat to maintain body or tissue temperatures above that of the environment. Bluefin tunas are commercially important fishes worldwide, and some populations are threatened. They are renowned for their endothermy, maintaining elevated temperatures of the oxidative locomotor muscle, viscera, brain and eyes, and occupying cold, productive high-latitude waters. Less cold-tolerant tunas, such as yellowfin tuna, by contrast, remain in warm-temperate to tropical waters year-round, reproducing more rapidly than most temperate bluefin tuna populations, providing resiliency in the face of large-scale industrial fisheries. Despite the importance of these traits to not only fisheries but also habitat utilization and responses to climate change, little is known of the genetic processes underlying the diversification of tunas. In collecting and analyzing sequence data across 29,556 genes, we found that parallel selection on standing genetic variation is associated with the evolution of endothermy in bluefin tunas. This includes two shared substitutions in genes encoding glycerol-3 phosphate dehydrogenase, an enzyme that contributes to thermogenesis in bumblebees and mammals, as well as four genes involved in the Krebs cycle, oxidative phosphorylation, β-oxidation, and superoxide removal. Using phylogenetic techniques, we further illustrate that the eight Thunnus species are genetically distinct, but found evidence of mitochondrial genome introgression across two species. Phylogeny-based metrics highlight conservation needs for some of these species.
Collapse
Affiliation(s)
- Adam G Ciezarek
- Department of Life Sciences, Silwood Park Campus, Imperial College London, Ascot, United Kingdom
| | - Owen G Osborne
- Department of Life Sciences, Silwood Park Campus, Imperial College London, Ascot, United Kingdom
| | - Oliver N Shipley
- Shark Research and Conservation Program, The Cape Eleuthera Institute, Rock Sound, Eleuthera, The Bahamas
- School of Marine and Atmospheric Science, Stony Brook University, Stony Brook, NY
| | - Edward J Brooks
- Shark Research and Conservation Program, The Cape Eleuthera Institute, Rock Sound, Eleuthera, The Bahamas
| | - Sean R Tracey
- Institute for Marine and Antarctic Studies, University of Tasmania, Hobart, TAS, Australia
| | - Jaime D McAllister
- Institute for Marine and Antarctic Studies, University of Tasmania, Hobart, TAS, Australia
| | - Luke D Gardner
- Department of Biology, Hopkins Marine Station, Stanford University, Pacific Grove, CA
| | - Michael J E Sternberg
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, Kensington, London, United Kingdom
| | - Barbara Block
- Department of Biology, Hopkins Marine Station, Stanford University, Pacific Grove, CA
| | - Vincent Savolainen
- Department of Life Sciences, Silwood Park Campus, Imperial College London, Ascot, United Kingdom
- Corresponding author: E-mail:
| |
Collapse
|
13
|
Reddy RRS, Ramanujam MV. High Throughput Sequencing-Based Approaches for Gene Expression Analysis. Methods Mol Biol 2019; 1783:299-323. [PMID: 29767369 DOI: 10.1007/978-1-4939-7834-2_15] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Next-generation sequencing has emerged as the method of choice to answer fundamental questions in biology. The massively parallel sequencing technology for RNA-Seq analysis enables better understanding of gene expression patterns in model and nonmodel organisms. Sequencing per se has reached the stage of commodity level while analyzing and interpreting huge amount of data has been a significant challenge. This chapter is aimed at discussing the complexities involved in sequencing and analysis, and tries to simplify sequencing based gene expression analysis. Biologists and experimental scientists were kept in mind while discussing the methods and analysis workflow.
Collapse
Affiliation(s)
| | - M V Ramanujam
- Clevergene Biocorp Private Limited, Bangalore, Karnataka, India.
| |
Collapse
|
14
|
Cipcigan F, Carrieri AP, Pyzer-Knapp EO, Krishna R, Hsiao YW, Winn M, Ryadnov MG, Edge C, Martyna G, Crain J. Accelerating molecular discovery through data and physical sciences: Applications to peptide-membrane interactions. J Chem Phys 2018; 148:241744. [DOI: 10.1063/1.5027261] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Affiliation(s)
- Flaviu Cipcigan
- IBM Research UK, Hartree Centre, Daresbury WA4 4AD, United Kingdom
| | | | | | - Ritesh Krishna
- IBM Research UK, Hartree Centre, Daresbury WA4 4AD, United Kingdom
| | - Ya-Wen Hsiao
- STFC Daresbury Laboratories, Daresbury WA4 4AD, United Kingdom
| | - Martyn Winn
- STFC Daresbury Laboratories, Daresbury WA4 4AD, United Kingdom
| | - Maxim G. Ryadnov
- National Physical Laboratory, Hampton Road, Teddington, United Kingdom
| | - Colin Edge
- GSK Medicines Research Centre, Stevenage SG1 2NY, United Kingdom
| | - Glenn Martyna
- IBM T. J. Watson Research Center, Yorktown Heights, New York 10598, USA
| | - Jason Crain
- IBM Research UK, Hartree Centre, Daresbury WA4 4AD, United Kingdom
- Maxwell Centre, University of Cambridge, Cambridge CB3 0HE, United Kingdom
| |
Collapse
|
15
|
Abstract
Most reconstruction methods for genomes of ancient origin that are used today require a closely related reference. In order to identify genomic rearrangements or the deletion of whole genes, de novo assembly has to be used. However, because of inherent problems with ancient DNA, its de novo assembly is highly complicated. In order to tackle the diversity in the length of the input reads, we propose a two-layer approach, where multiple assemblies are generated in the first layer, which are then combined in the second layer. We used this two-layer assembly to generate assemblies for two different ancient samples and compared the results to current de novo assembly approaches. We are able to improve the assembly with respect to the length of the contigs and can resolve more repetitive regions.
Collapse
Affiliation(s)
- Alexander Seitz
- Center for Bioinformatics (ZBIT), Integrative Transcriptomics, Eberhard-Karls-Universität Tübingen , Tübingen , Germany
| | - Kay Nieselt
- Center for Bioinformatics (ZBIT), Integrative Transcriptomics, Eberhard-Karls-Universität Tübingen , Tübingen , Germany
| |
Collapse
|
16
|
An extracellular yellow laccase with potent dye decolorizing ability from the fungus Leucoagaricus naucinus LAC-04. Int J Biol Macromol 2016; 93:837-842. [DOI: 10.1016/j.ijbiomac.2016.09.046] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Revised: 09/11/2016] [Accepted: 09/15/2016] [Indexed: 01/09/2023]
|