1
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
2
|
Brooks TG, Lahens NF, Mrčela A, Sarantopoulou D, Nayak S, Naik A, Sengupta S, Choi PS, Grant GR. BEERS2: RNA-Seq simulation through high fidelity in silico modeling. Brief Bioinform 2024; 25:bbae164. [PMID: 38605641 PMCID: PMC11009461 DOI: 10.1093/bib/bbae164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 01/26/2024] [Accepted: 03/26/2024] [Indexed: 04/13/2024] Open
Abstract
Simulation of RNA-seq reads is critical in the assessment, comparison, benchmarking and development of bioinformatics tools. Yet the field of RNA-seq simulators has progressed little in the last decade. To address this need we have developed BEERS2, which combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline. BEERS2 takes input transcripts (typically fully length messenger RNA transcripts with polyA tails) from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome. It also produces true transcript-level quantification values. BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to include the effects of polyA selection and RiboZero for ribosomal depletion, hexamer priming sequence biases, GC-content biases in polymerase chain reaction (PCR) amplification, barcode read errors and errors during PCR amplification. These characteristics combine to make BEERS2 the most complete simulation of RNA-seq to date. Finally, we demonstrate the use of BEERS2 by measuring the effect of several settings on the popular Salmon pseudoalignment algorithm.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Amruta Naik
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shaon Sengupta
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Peter S Choi
- Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
3
|
Ntasis VF, Guigó R. Studying relative RNA localization From nucleus to the cytosol. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.06.583744. [PMID: 38559161 PMCID: PMC10979850 DOI: 10.1101/2024.03.06.583744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
The precise coordination of important biological processes, such as differentiation and development, is highly dependent on the regulation of expression of the genetic information. The flow of the genetic information is tightly regulated on multiple levels. Among them, RNA export to cytosol is an essential step for the production of proteins in eukaryotic cells. Hence, estimating the relative concentration of RNA molecules of a given transcript species in the nucleus and in the cytosol is of major significance as it contributes to the understanding of the dynamics of RNA trafficking between the nucleus and the cytosol. The most efficient way to estimate the levels of RNA species genome-wide is through RNA sequencing (RNAseq). While RNAseq can be performed separately in the nucleus and in the cytosol, because measured transcript levels are relative to the total volume of RNA in these compartments, and because this volume is usually unknown, the transcript levels in the nucleus and in the cytosol cannot be directly compared. Here we show theoretically that if, in addition to nuclear and cytosolic RNA-seq, whole cell RNA-seq is also performed, then accurate estimations of the localization of transcripts can be obtained. Based on this, we designed a method that estimates, first the fraction of the total RNA volume in the cytosol (nucleus), and then, this fraction for every transcript. We evaluate our methodology on simulated data and nuclear and cytosolic single cell data available. Finally, we use our method to investigate the cellular localization of transcripts using bulk RNAseq data from the ENCODE project.
Collapse
Affiliation(s)
- Vasilis F. Ntasis
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Barcelona, Catalonia, Spain
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Barcelona, Catalonia, Spain
- Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| |
Collapse
|
4
|
Lio CT, Düz T, Hoffmann M, Willruth LL, Baumbach J, List M, Tsoy O. Comprehensive benchmark of differential transcript usage analysis for static and dynamic conditions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.14.575548. [PMID: 38313260 PMCID: PMC10836064 DOI: 10.1101/2024.01.14.575548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2024]
Abstract
RNA sequencing offers unique insights into transcriptome diversity, and a plethora of tools have been developed to analyze alternative splicing. One important task is to detect changes in the relative transcript abundance in differential transcript usage (DTU) analysis. The choice of the right analysis tool is non-trivial and depends on experimental factors such as the availability of single- or paired-end and bulk or single-cell data. To help users select the most promising tool for their task, we performed a comprehensive benchmark of DTU detection tools. We cover a wide array of experimental settings, using simulated bulk and single-cell RNA-seq data as well as real transcriptomics datasets, including time-series data. Our results suggest that DEXSeq, edgeR, and LimmaDS are better choices for paired-end data, while DSGseq and DEXSeq can be used for single-end data. In single-cell simulation settings, we showed that satuRn performs better than DTUrtle. In addition, we showed that Spycone is optimal for time series DTU/IS analysis based on the evidence provided using GO terms enrichment analysis.
Collapse
Affiliation(s)
- Chit Tong Lio
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Tolga Düz
- Chair of Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
| | - Markus Hoffmann
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
- Institute for Advanced Study, Technical University of Munich, Garching D-85748, Germany
- National Institute of Diabetes, Digestive, and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Lina-Liv Willruth
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Jan Baumbach
- Chair of Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Campusvej 55, 5000 Odense, Denmark
| | - Markus List
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Olga Tsoy
- Chair of Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
| |
Collapse
|
5
|
Lee J, Kim M, Han K, Yoon S. StringFix: an annotation-guided transcriptome assembler improves the recovery of amino acid sequences from RNA-Seq reads. Genes Genomics 2023; 45:1599-1609. [PMID: 37837515 DOI: 10.1007/s13258-023-01458-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 10/01/2023] [Indexed: 10/16/2023]
Abstract
BACKGROUND Reconstruction of amino acid sequences from assembled transcriptome is of interest in personalized medicine, for example, to predict drug-target (or protein-protein) interaction considering individual's genomic variations. Most of the existing transcriptome assemblers, however, seems not well suited for this purpose. METHODS In this work, we present StringFix, an annotation guided transcriptome assembly and protein sequence reconstruction software tool that takes genome-aligned reads and the annotations associated to the reference genome as input. The tool 'fixes' the pre-annotated transcript sequence by taking small variations into account, finally to produce possible amino acid sequences that are likely to exist in the test tissue. RESULTS The results show that, using outputs from existing reference-based assemblers as the input GTF-guide, StringFix could reconstruct amino acid sequences more precisely with higher sensitivity than direct generation using the recovered transcripts from all the assemblers we tested. CONCLUSION By using StringFix with the existing reference-based assemblers, one can recover not only a novel transcripts and isoforms but also the possible amino acid sequence stemming from them.
Collapse
Affiliation(s)
- Joongho Lee
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Minsoo Kim
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Kyudong Han
- Center for Bio-Medical Engineering Core Facility, Dankook Univ, Cheonan, 31116, Korea
- Dept. of Microbiology, College of Science & Technology, Dankook Univ, Cheonan, 31116, Korea
- HuNbiome Co., Ltd, R&D Center, Seoul, 08503, Korea
| | - Seokhyun Yoon
- Dept. of Electronics and Electrical Engineering, College of Engineering, Dankook Univ, Yongin-si, 16890, Korea.
| |
Collapse
|
6
|
Bar A, Argaman L, Eldar M, Margalit H. TRS: a method for determining transcript termini from RNAtag-seq sequencing data. Nat Commun 2023; 14:7843. [PMID: 38030608 PMCID: PMC10687069 DOI: 10.1038/s41467-023-43534-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Accepted: 11/12/2023] [Indexed: 12/01/2023] Open
Abstract
In bacteria, determination of the 3' termini of transcripts plays an essential role in regulation of gene expression, affecting the functionality and stability of the transcript. Several experimental approaches were developed to identify the 3' termini of transcripts, however, these were applied only to a limited number of bacteria and growth conditions. Here we present a straightforward approach to identify 3' termini from widely available RNA-seq data without the need for additional experiments. Our approach relies on the observation that the RNAtag-seq sequencing protocol results in overabundance of reads mapped to transcript 3' termini. We present TRS (Termini by Read Starts), a computational pipeline exploiting this property to identify 3' termini in RNAtag-seq data, and show that the identified 3' termini are highly reliable. Since RNAtag-seq data are widely available for many bacteria and growth conditions, our approach paves the way for studying bacterial transcription termination in an unprecedented scope.
Collapse
Affiliation(s)
- Amir Bar
- Department of Microbiology and Molecular Genetics IMRIC, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Liron Argaman
- Department of Microbiology and Molecular Genetics IMRIC, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Michal Eldar
- Department of Microbiology and Molecular Genetics IMRIC, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Hanah Margalit
- Department of Microbiology and Molecular Genetics IMRIC, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel.
| |
Collapse
|
7
|
Dias FHC, Cáceres M, Williams L, Mumey B, Tomescu AI. A safety framework for flow decomposition problems via integer linear programming. Bioinformatics 2023; 39:btad640. [PMID: 37862229 PMCID: PMC10628435 DOI: 10.1093/bioinformatics/btad640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 09/05/2023] [Accepted: 10/19/2023] [Indexed: 10/22/2023] Open
Abstract
MOTIVATION Many important problems in Bioinformatics (e.g. assembly or multiassembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding "safe" partial solutions (e.g. contigs) which are common to all solutions. Previous research on safety has focused on polynomially time solvable problems, whereas many successful and natural models are NP-hard to solve, leaving a lack of "safety tools" for such problems. We propose the first method for computing all safe solutions for an NP-hard problem, "minimum flow decomposition" (MFD). We obtain our results by developing a "safety test" for paths based on a general integer linear programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time, the most efficient of these being based on a recursive group-testing procedure. RESULTS Experimental results on transcriptome datasets show that all safe paths for MFDs correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27 000 non-trivial graphs of this dataset in only 1.5 h. Our results suggest that, on perfect data, there is less ambiguity than thought in the notoriously hard RNA assembly problem. AVAILABILITY AND IMPLEMENTATION https://github.com/algbio/mfd-safety.
Collapse
Affiliation(s)
- Fernando H C Dias
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|
8
|
Yi H, Lin Y, Chang Q, Jin W. A fast and globally optimal solution for RNA-seq quantification. Brief Bioinform 2023; 24:bbad298. [PMID: 37595963 DOI: 10.1093/bib/bbad298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 07/25/2023] [Accepted: 07/31/2023] [Indexed: 08/20/2023] Open
Abstract
Alignment-based RNA-seq quantification methods typically involve a time-consuming alignment process prior to estimating transcript abundances. In contrast, alignment-free RNA-seq quantification methods bypass this step, resulting in significant speed improvements. Existing alignment-free methods rely on the Expectation-Maximization (EM) algorithm for estimating transcript abundances. However, EM algorithms only guarantee locally optimal solutions, leaving room for further accuracy improvement by finding a globally optimal solution. In this study, we present TQSLE, the first alignment-free RNA-seq quantification method that provides a globally optimal solution for transcript abundances estimation. TQSLE adopts a two-step approach: first, it constructs a k-mer frequency matrix A for the reference transcriptome and a k-mer frequency vector b for the RNA-seq reads; then, it directly estimates transcript abundances by solving the linear equation ATAx = ATb. We evaluated the performance of TQSLE using simulated and real RNA-seq data sets and observed that, despite comparable speed to other alignment-free methods, TQSLE outperforms them in terms of accuracy. TQSLE is freely available at https://github.com/yhg926/TQSLE.
Collapse
Affiliation(s)
- Huiguang Yi
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, 97 Buxin Rd, Shenzhen, 518000, Guangdong, China
- School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Blvd, Shenzhen 518055, Guangdong, China
| | - Yanling Lin
- School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Blvd, Shenzhen 518055, Guangdong, China
| | - Qing Chang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, 97 Buxin Rd, Shenzhen, 518000, Guangdong, China
| | - Wenfei Jin
- School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Blvd, Shenzhen 518055, Guangdong, China
| |
Collapse
|
9
|
Li X, Shao M. On de novo Bridging Paired-end RNA-seq Data. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2023; 2023:41. [PMID: 38045531 PMCID: PMC10692976 DOI: 10.1145/3584371.3612987] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome-a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph (cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.
Collapse
Affiliation(s)
- Xiang Li
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Mingfu Shao
- Department of Computer Science and Engineering, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
10
|
Borozan L, Rojas Ringeling F, Kao SY, Nikonova E, Monteagudo-Mesas P, Matijević D, Spletter ML, Canzar S. Counting pseudoalignments to novel splicing events. Bioinformatics 2023; 39:btad419. [PMID: 37432342 PMCID: PMC10348833 DOI: 10.1093/bioinformatics/btad419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 04/21/2023] [Accepted: 07/10/2023] [Indexed: 07/12/2023] Open
Abstract
MOTIVATION Alternative splicing (AS) of introns from pre-mRNA produces diverse sets of transcripts across cell types and tissues, but is also dysregulated in many diseases. Alignment-free computational methods have greatly accelerated the quantification of mRNA transcripts from short RNA-seq reads, but they inherently rely on a catalog of known transcripts and might miss novel, disease-specific splicing events. By contrast, alignment of reads to the genome can effectively identify novel exonic segments and introns. Event-based methods then count how many reads align to predefined features. However, an alignment is more expensive to compute and constitutes a bottleneck in many AS analysis methods. RESULTS Here, we propose fortuna, a method that guesses novel combinations of annotated splice sites to create transcript fragments. It then pseudoaligns reads to fragments using kallisto and efficiently derives counts of the most elementary splicing units from kallisto's equivalence classes. These counts can be directly used for AS analysis or summarized to larger units as used by other widely applied methods. In experiments on synthetic and real data, fortuna was around 7× faster than traditional align and count approaches, and was able to analyze almost 300 million reads in just 15 min when using four threads. It mapped reads containing mismatches more accurately across novel junctions and found more reads supporting aberrant splicing events in patients with autism spectrum disorder than existing methods. We further used fortuna to identify novel, tissue-specific splicing events in Drosophila. AVAILABILITY AND IMPLEMENTATION fortuna source code is available at https://github.com/canzarlab/fortuna.
Collapse
Affiliation(s)
- Luka Borozan
- Department of Mathematics, Josip Juraj Strossmayer University of Osijek, Osijek 31000, Croatia
| | - Francisca Rojas Ringeling
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| | - Shao-Yen Kao
- Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany
| | - Elena Nikonova
- Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany
| | | | - Domagoj Matijević
- Department of Mathematics, Josip Juraj Strossmayer University of Osijek, Osijek 31000, Croatia
| | - Maria L Spletter
- Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany
- School of Science and Engineering, Division of Biological & Biomedical Systems, University of Missouri Kansas City, Kansas City, MO 64110, United States
| | - Stefan Canzar
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| |
Collapse
|
11
|
Brooks TG, Lahens NF, Mrčela A, Sarantopoulou D, Nayak S, Naik A, Sengupta S, Choi PS, Grant GR. BEERS2: RNA-Seq simulation through high fidelity in silico modeling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.21.537847. [PMID: 37162982 PMCID: PMC10168222 DOI: 10.1101/2023.04.21.537847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Simulation of RNA-seq reads is critical in the assessment, comparison, benchmarking, and development of bioinformatics tools. Yet the field of RNA-seq simulators has progressed little in the last decade. To address this need we have developed BEERS2, which combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline. BEERS2 takes input transcripts (typically fully-length mRNA transcripts with polyA tails) from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM, or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome. It also produces true transcript-level quantification values. BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to include the effects of polyA selection and RiboZero for ribosomal depletion, hexamer priming sequence biases, GC-content biases in PCR amplification, barcode read errors, and errors during PCR amplification. These characteristics combine to make BEERS2 the most complete simulation of RNA-seq to date. Finally, we demonstrate the use of BEERS2 by measuring the effect of several settings on the popular Salmon pseudoalignment algorithm.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Amruta Naik
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shaon Sengupta
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Peter S Choi
- Division of Cancer Pathobiology, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
12
|
Imada EL, Wilks C, Langmead B, Marchionni L. REPAC: analysis of alternative polyadenylation from RNA-sequencing data. Genome Biol 2023; 24:22. [PMID: 36759904 PMCID: PMC9912678 DOI: 10.1186/s13059-023-02865-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 01/24/2023] [Indexed: 02/11/2023] Open
Abstract
Alternative polyadenylation (APA) is an important post-transcriptional mechanism that has major implications in biological processes and diseases. Although specialized sequencing methods for polyadenylation exist, availability of these data are limited compared to RNA-sequencing data. We developed REPAC, a framework for the analysis of APA from RNA-sequencing data. Using REPAC, we investigate the landscape of APA caused by activation of B cells. We also show that REPAC is faster than alternative methods by at least 7-fold and that it scales well to hundreds of samples. Overall, the REPAC method offers an accurate, easy, and convenient solution for the exploration of APA.
Collapse
Affiliation(s)
- Eddie L. Imada
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, USA
| | - Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Luigi Marchionni
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, USA
| |
Collapse
|
13
|
Williams L, Tomescu AI, Mumey B. Flow Decomposition With Subpath Constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:360-370. [PMID: 35104222 DOI: 10.1109/tcbb.2022.3147697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e., decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.
Collapse
|
14
|
Khan S, Kortelainen M, Cáceres M, Williams L, Tomescu AI. Improving RNA Assembly via Safety and Completeness in Flow Decompositions. J Comput Biol 2022; 29:1270-1287. [PMID: 36288562 PMCID: PMC9807076 DOI: 10.1089/cmb.2022.0261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4-5×) and space performance (1.2-2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.
Collapse
Affiliation(s)
- Shahbaz Khan
- Department of Computer Science and Engineering, IIT Roorkee, Roorkee, India.,Department of Computer Science, University of Helsinki, Helsinki, Finland.,Address correspondence to: Prof. Shahbaz Khan, Department of Computer Science and Engineering, IIT Roorkee, Haridwar Highway, Roorkee 247667, Uttarakhand, India
| | - Milla Kortelainen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
15
|
Fahmi NA, Ahmed KT, Chang JW, Nassereddeen H, Fan D, Yong J, Zhang W. APA-Scan: detection and visualization of 3'-UTR alternative polyadenylation with RNA-seq and 3'-end-seq data. BMC Bioinformatics 2022; 23:396. [PMID: 36171568 PMCID: PMC9520800 DOI: 10.1186/s12859-022-04939-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/26/2022] Open
Abstract
Background The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3′-untranslated region (3′-UTR) of mRNA produces transcripts with shorter or longer 3′-UTR. Often, 3′-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3′-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3′-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3′-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3′-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.
Methods APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3′-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3′-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3′-UTR annotation and read coverage on the 3′-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user’s manual are freely available at https://github.com/compbiolabucf/APA-Scan. Result APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3′-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3′-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3′-UTR APA events and improve genome annotation. Conclusion APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3′-UTR APA events. The pipeline integrates both RNA-seq and 3′-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04939-w.
Collapse
Affiliation(s)
- Naima Ahmed Fahmi
- Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA
| | - Khandakar Tanvir Ahmed
- Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA
| | - Jae-Woong Chang
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, 420 Washington Ave. S.E., Minneapolis, MN, 55455, USA
| | - Heba Nassereddeen
- Department of Computer Engineering, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA
| | - Deliang Fan
- School of Electrical, Computer and Energy Engineering, Arizona State University, 650 E Tyler Mall, Tempe, AZ, 85287, USA
| | - Jeongsik Yong
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, 420 Washington Ave. S.E., Minneapolis, MN, 55455, USA.
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA.
| |
Collapse
|
16
|
Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.06.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
17
|
Ruhela V, Gupta A, Sriram K, Ahuja G, Kaur G, Gupta R. A Unified Computational Framework for a Robust, Reliable, and Reproducible Identification of Novel miRNAs From the RNA Sequencing Data. FRONTIERS IN BIOINFORMATICS 2022; 2:842051. [PMID: 36304305 PMCID: PMC9580950 DOI: 10.3389/fbinf.2022.842051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Accepted: 06/02/2022] [Indexed: 11/13/2022] Open
Abstract
In eukaryotic cells, miRNAs regulate a plethora of cellular functionalities ranging from cellular metabolisms, and development to the regulation of biological networks and pathways, both under homeostatic and pathological states like cancer.Despite their immense importance as key regulators of cellular processes, accurate and reliable estimation of miRNAs using Next Generation Sequencing is challenging, largely due to the limited availability of robust computational tools/methods/pipelines. Here, we introduce miRPipe, an end-to-end computational framework for the identification, characterization, and expression estimation of small RNAs, including the known and novel miRNAs and previously annotated pi-RNAs from small-RNA sequencing profiles. Our workflow detects unique novel miRNAs by incorporating the sequence information of seed and non-seed regions, concomitant with clustering analysis. This approach allows reliable and reproducible detection of unique novel miRNAs and functionally same miRNAs (paralogues). We validated the performance of miRPipe with the available state-of-the-art pipelines using both synthetic datasets generated using the newly developed miRSim tool and three cancer datasets (Chronic Lymphocytic Leukemia, Lung cancer, and breast cancer). In the experiment over the synthetic dataset, miRPipe is observed to outperform the existing state-of-the-art pipelines (accuracy: 95.23% and F1-score: 94.17%). Analysis on all the three cancer datasets shows that miRPipe is able to extract more number of known dysregulated miRNAs or piRNAs from the datasets as compared to the existing pipelines.
Collapse
Affiliation(s)
- Vivek Ruhela
- Department of Computational Biology & Centre for Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-D), New Delhi, India
- *Correspondence: Vivek Ruhela, ; Anubha Gupta, ; Ritu Gupta,
| | - Anubha Gupta
- SBILab, Department of ECE & Centre of Excellence in Healthcare, Indraprastha Institute of Information Technology-Delhi (IIIT-D), New Delhi, India
- *Correspondence: Vivek Ruhela, ; Anubha Gupta, ; Ritu Gupta,
| | - K. Sriram
- Department of Computational Biology & Centre for Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-D), New Delhi, India
| | - Gaurav Ahuja
- Department of Computational Biology & Centre for Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-D), New Delhi, India
| | - Gurvinder Kaur
- Laboratory Oncology Unit, IRCH, All India Institute of Medical Sciences (AIIMS), New Delhi, India
| | - Ritu Gupta
- Laboratory Oncology Unit, IRCH, All India Institute of Medical Sciences (AIIMS), New Delhi, India
- *Correspondence: Vivek Ruhela, ; Anubha Gupta, ; Ritu Gupta,
| |
Collapse
|
18
|
Privitera GF, Alaimo S, Ferro A, Pulvirenti A. Virus finding tools: current solutions and limitations. Brief Bioinform 2022; 23:6618234. [PMID: 35753694 DOI: 10.1093/bib/bbac235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 05/02/2022] [Accepted: 05/20/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The study of the Human Virome remains challenging nowadays. Viral metagenomics, through high-throughput sequencing data, is the best choice for virus discovery. The metagenomics approach is culture-independent and sequence-independent, helping search for either known or novel viruses. Though it is estimated that more than 40% of the viruses found in metagenomics analysis are not recognizable, we decided to analyze several tools to identify and discover viruses in RNA-seq samples. RESULTS We have analyzed eight Virus Tools for the identification of viruses in RNA-seq data. These tools were compared using a synthetic dataset of 30 viruses and a real one. Our analysis shows that no tool succeeds in recognizing all the viruses in the datasets. So we can conclude that each of these tools has pros and cons, and their choice depends on the application domain. AVAILABILITY Synthetic data used through the review and raw results of their analysis can be found at https://zenodo.org/record/6426147. FASTQ files of real data can be found in GEO (https://www.ncbi.nlm.nih.gov/gds) or ENA (https://www.ebi.ac.uk/ena/browser/home). Raw results of their analysis can be downloaded from https://zenodo.org/record/6425917.
Collapse
Affiliation(s)
- Grete Francesca Privitera
- Department of Physics and Astronomy, University of Catania, Viale A. Doria, 6, 95125, Catania, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, c/o Dept. of Math. and Comp. Science Viale A. Doria, 6, 95125, Catania, Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, c/o Dept. of Math. and Comp. Science Viale A. Doria, 6, 95125, Catania, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, c/o Dept. of Math. and Comp. Science Viale A. Doria, 6, 95125, Catania, Italy
| |
Collapse
|
19
|
Lee K, Yu D, Hyung D, Cho SY, Park C. ASpediaFI: Functional Interaction Analysis of Alternative Splicing Events. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:466-482. [PMID: 35085775 PMCID: PMC9801047 DOI: 10.1016/j.gpb.2021.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 10/15/2021] [Accepted: 11/01/2021] [Indexed: 01/26/2023]
Abstract
Alternative splicing (AS) regulates biological processes governing phenotypes and diseases. Differential AS (DAS) gene test methods have been developed to investigate important exonic expression from high-throughput datasets. However, the DAS events extracted using statistical tests are insufficient to delineate relevant biological processes. In this study, we developed a novel application, Alternative Splicing Encyclopedia: Functional Interaction (ASpediaFI), to systemically identify DAS events and co-regulated genes and pathways. ASpediaFI establishes a heterogeneous interaction network of genes and their feature nodes (i.e., AS events and pathways) connected by co-expression or pathway gene set knowledge. Next, ASpediaFI explores the interaction network using the random walk with restart algorithm and interrogates the proximity from a query gene set. Finally, ASpediaFI extracts significant AS events, genes, and pathways. To evaluate the performance of our method, we simulated RNA sequencing (RNA-seq) datasets to consider various conditions of sequencing depth and sample size. The performance was compared with that of other methods. Additionally, we analyzed three public datasets of cancer patients or cell lines to evaluate how well ASpediaFI detects biologically relevant candidates. ASpediaFI exhibits strong performance in both simulated and public datasets. Our integrative approach reveals that DAS events that recognize a global co-expression network and relevant pathways determine the functional importance of spliced genes in the subnetwork. ASpediaFI is publicly available at https://bioconductor.org/packages/ASpediaFI.
Collapse
|
20
|
Shumate A, Wong B, Pertea G, Pertea M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput Biol 2022; 18:e1009730. [PMID: 35648784 PMCID: PMC9191730 DOI: 10.1371/journal.pcbi.1009730] [Citation(s) in RCA: 99] [Impact Index Per Article: 49.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 06/13/2022] [Accepted: 05/11/2022] [Indexed: 01/01/2023] Open
Abstract
Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.
Collapse
Affiliation(s)
- Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Brandon Wong
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Applied Math and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Geo Pertea
- The Lieber Institute for Brain Development, Baltimore, Maryland, United States of America
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
21
|
ASTool: An Easy-to-Use Tool to Accurately Identify Alternative Splicing Events from Plant RNA-Seq Data. Int J Mol Sci 2022; 23:ijms23084079. [PMID: 35456896 PMCID: PMC9031537 DOI: 10.3390/ijms23084079] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Revised: 04/01/2022] [Accepted: 04/03/2022] [Indexed: 11/16/2022] Open
Abstract
Alternative splicing (AS) is an essential co-transcriptional regulatory mechanism in eukaryotes. The accumulation of plant RNA-Seq data provides an unprecedented opportunity to investigate the global landscape of plant AS events. However, most existing AS identification tools were originally designed for animals, and their performance in plants was not rigorously benchmarked. In this work, we developed a simple and easy-to-use bioinformatics tool named ASTool for detecting AS events from plant RNA-Seq data. As an exon-based method, ASTool can detect 4 major AS types, including intron retention (IR), exon skipping (ES), alternative 5′ splice sites (A5SS), and alternative 3′ splice sites (A3SS). Compared with existing tools, ASTool revealed a favorable performance when tested in simulated RNA-Seq data, with both recall and precision values exceeding 95% in most cases. Moreover, ASTool also showed a competitive computational speed and consistent detection results with existing tools when tested in simulated or real plant RNA-Seq data. Considering that IR is the most predominant AS type in plants, ASTool allowed the detection and visualization of novel IR events based on known splice sites. To fully present the functionality of ASTool, we also provided an application example of ASTool in processing real RNA-Seq data of Arabidopsis in response to heat stress.
Collapse
|
22
|
Grealey J, Lannelongue L, Saw WY, Marten J, Méric G, Ruiz-Carmona S, Inouye M. THE CARBON FOOTPRINT OF BIOINFORMATICS. Mol Biol Evol 2022; 39:6526403. [PMID: 35143670 PMCID: PMC8892942 DOI: 10.1093/molbev/msac034] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Bioinformatic research relies on large-scale computational infrastructures which have a nonzero carbon footprint but so far, no study has quantified the environmental costs of bioinformatic tools and commonly run analyses. In this work, we estimate the carbon footprint of bioinformatics (in kilograms of CO2 equivalent units, kgCO2e) using the freely available Green Algorithms calculator (www.green-algorithms.org, last accessed 2022). We assessed 1) bioinformatic approaches in genome-wide association studies (GWAS), RNA sequencing, genome assembly, metagenomics, phylogenetics, and molecular simulations, as well as 2) computation strategies, such as parallelization, CPU (central processing unit) versus GPU (graphics processing unit), cloud versus local computing infrastructure, and geography. In particular, we found that biobank-scale GWAS emitted substantial kgCO2e and simple software upgrades could make it greener, for example, upgrading from BOLT-LMM v1 to v2.3 reduced carbon footprint by 73%. Moreover, switching from the average data center to a more efficient one can reduce carbon footprint by approximately 34%. Memory over-allocation can also be a substantial contributor to an algorithm’s greenhouse gas emissions. The use of faster processors or greater parallelization reduces running time but can lead to greater carbon footprint. Finally, we provide guidance on how researchers can reduce power consumption and minimize kgCO2e. Overall, this work elucidates the carbon footprint of common analyses in bioinformatics and provides solutions which empower a move toward greener research.
Collapse
Affiliation(s)
- Jason Grealey
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Department of Mathematics and Statistics, La Trobe University, Melbourne, Australia
| | - Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Woei-Yuh Saw
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
| | - Jonathan Marten
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Australia
| | - Sergio Ruiz-Carmona
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.,British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK.,The Alan Turing Institute, London, UK
| |
Collapse
|
23
|
Zhao J, Feng H, Zhu D, Lin Y. MultiTrans: An Algorithm for Path Extraction Through Mixed Integer Linear Programming for Transcriptome Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:48-56. [PMID: 34033544 DOI: 10.1109/tcbb.2021.3083277] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent advances in RNA-seq technology have made identification of expressed genes affordable, and thus boosting repaid development of transcriptomic studies. Transcriptome assembly, reconstructing all expressed transcripts from RNA-seq reads, is an essential step to understand genes, proteins, and cell functions. Transcriptome assembly remains a challenging problem due to complications in splicing variants, expression levels, uneven coverage and sequencing errors. Here, we formulate the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming. MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. We benchmarked MultiTrans against two state-of-the-art transcriptome assemblers, TransLiG and rnaSPAdes. Experimental results show that MultiTrans generates more accurate transcripts compared to TransLiG (using the same splicing graphs) and rnaSPAdes (using the same assembly graphs). MultiTrans is freely available at https://github.com/jzbio/MultiTrans.
Collapse
|
24
|
Voshall A, Behera S, Li X, Yu XH, Kapil K, Deogun JS, Shanklin J, Cahoon EB, Moriyama EN. A consensus-based ensemble approach to improve transcriptome assembly. BMC Bioinformatics 2021; 22:513. [PMID: 34674629 PMCID: PMC8532302 DOI: 10.1186/s12859-021-04434-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 10/10/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes. RESULTS In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble. CONCLUSIONS Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genome-guided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from: http://bioinfolab.unl.edu/emlab/consemble/ .
Collapse
Affiliation(s)
- Adam Voshall
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital/Harvard Medical School, Boston, MA, 02115, USA
| | - Sairam Behera
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Xiangjun Li
- Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Xiao-Hong Yu
- Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, 11794, USA
| | - Kushagra Kapil
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Jitender S Deogun
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - John Shanklin
- Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA
| | - Edgar B Cahoon
- Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.,Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Etsuko N Moriyama
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA. .,Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
| |
Collapse
|
25
|
Lahens NF, Brooks TG, Sarantopoulou D, Nayak S, Lawrence C, Mrčela A, Srinivasan A, Schug J, Hogenesch JB, Barash Y, Grant GR. CAMPAREE: a robust and configurable RNA expression simulator. BMC Genomics 2021; 22:692. [PMID: 34563123 PMCID: PMC8467241 DOI: 10.1186/s12864-021-07934-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 08/17/2021] [Indexed: 11/10/2022] Open
Abstract
Background The accurate interpretation of RNA-Seq data presents a moving target as scientists continue to introduce new experimental techniques and analysis algorithms. Simulated datasets are an invaluable tool to accurately assess the performance of RNA-Seq analysis methods. However, existing RNA-Seq simulators focus on modeling the technical biases and artifacts of sequencing, rather than on simulating the original RNA samples. A first step in simulating RNA-Seq is to simulate RNA. Results To fill this need, we developed the Configurable And Modular Program Allowing RNA Expression Emulation (CAMPAREE), a simulator using empirical data to simulate diploid RNA samples at the level of individual molecules. We demonstrated CAMPAREE’s use for generating idealized coverage plots from real data, and for adding the ability to generate allele-specific data to existing RNA-Seq simulators that do not natively support this feature. Conclusions Separating input sample modeling from library preparation/sequencing offers added flexibility for both users and developers to mix-and-match different sample and sequencing simulators to suit their specific needs. Furthermore, the ability to maintain sample and sequencing simulators independently provides greater agility to incorporate new biological findings about transcriptomics and new developments in sequencing technologies. Additionally, by simulating at the level of individual molecules, CAMPAREE has the potential to model molecules transcribed from the same genes as a heterogeneous population of transcripts with different states of degradation and processing (splicing, editing, etc.). CAMPAREE was developed in Python, is open source, and freely available at https://github.com/itmat/CAMPAREE. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07934-2.
Collapse
Affiliation(s)
- Nicholas F Lahens
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Thomas G Brooks
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Dimitra Sarantopoulou
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Present address: National Institute on Aging, National Institutes of Health, Baltimore, Maryland, USA
| | - Soumyashant Nayak
- Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Cris Lawrence
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Antonijo Mrčela
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Anand Srinivasan
- Perelman School of Medicine, Enterprise Research Applications and High Performance Computing, Penn Medicine Academic Computing Services, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jonathan Schug
- The Institute for Diabetes, Obesity and Metabolism, The Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - John B Hogenesch
- Division of Human Genetics, Department of Pediatrics, Center for Chronobiology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
| | - Yoseph Barash
- The Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Gregory R Grant
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA. .,The Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
| |
Collapse
|
26
|
Shi X, Wang X, Neuwald AF, Halakivi-Clarke L, Clarke R, Xuan J. A Bayesian approach for accurate de novo transcriptome assembly. Sci Rep 2021; 11:17663. [PMID: 34480063 PMCID: PMC8417280 DOI: 10.1038/s41598-021-97015-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 05/17/2021] [Indexed: 11/09/2022] Open
Abstract
De novo transcriptome assembly from billions of RNA-seq reads is very challenging due to alternative splicing and various levels of expression, which often leads to incorrect, mis-assembled transcripts. BayesDenovo addresses this problem by using both a read-guided strategy to accurately reconstruct splicing graphs from the RNA-seq data and a Bayesian strategy to estimate, from these graphs, the probability of transcript expression without penalizing poorly expressed transcripts. Simulation and cell line benchmark studies demonstrate that BayesDenovo is very effective in reducing false positives and achieves much higher accuracy than other assemblers, especially for alternatively spliced genes and for highly or poorly expressed transcripts. Moreover, BayesDenovo is more robust on multiple replicates by assembling a larger portion of common transcripts. When applied to breast cancer data, BayesDenovo identifies phenotype-specific transcripts associated with breast cancer recurrence.
Collapse
Affiliation(s)
- Xu Shi
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA, 22203, USA
| | - Xiao Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA, 22203, USA
| | - Andrew F Neuwald
- Institute for Genome Sciences and Department Biochemistry and Molecular Biology, University of Maryland School of Medicine, 670 W. Baltimore Street, Baltimore, MD, 21201, USA
| | | | - Robert Clarke
- Hormel Institute, University of Minnesota, 16th Street N, Austin, MN, 55912, USA
| | - Jianhua Xuan
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA, 22203, USA.
| |
Collapse
|
27
|
Li M, Bai M, Wu Y, Shao W, Zheng L, Sun L, Wang S, Yu C, Huang Y. AGTAR: A novel approach for transcriptome assembly and abundance estimation using an adapted genetic algorithm from RNA-seq data. Comput Biol Med 2021; 135:104646. [PMID: 34274894 DOI: 10.1016/j.compbiomed.2021.104646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 06/20/2021] [Accepted: 07/07/2021] [Indexed: 11/25/2022]
Abstract
BACKGROUND Recently, the rapid development of RNA-seq technologies has accelerated transcriptomics research. The accurate identification and quantification of transcripts based on RNA-seq data will facilitate the exploration of various potential biological mechanisms. However, due to the limitations of the current data analysis tools and RNA-seq technologies, full and accurate reconstruction of the transcriptome still faces many challenges. RESULTS We developed the adapted genetic algorithm (AGTAR) program, which can reliably assemble transcriptomes and estimate abundance based on RNA-seq data with or without genome annotation files. We defined a new concept, isoform junction abundance, to help enhance the accuracy of isoform identification and quantification. Isoform abundance and isoform junction abundance are estimated by an adapted genetic algorithm. The crossover and mutation probabilities of the algorithm can be adaptively adjusted to effectively prevent premature convergence. Both simulated and real data indicated that AGTAR's comprehensive ability to assemble transcripts is significantly superior to that achievable by the currently widely used tools with similar functions. CONCLUSIONS AGTAR is a tool for identifying and quantifying transcripts from RNA-seq data. It has the advantages of higher accuracy and ease of use. The AGTAR package is freely available at https://github.com/v4yuezi/AGTAR.git.
Collapse
Affiliation(s)
- Mingyue Li
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China
| | - Miao Bai
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China
| | - Yulun Wu
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China
| | - Wenjun Shao
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China
| | - Lihua Zheng
- Research Center of Agriculture and Medicine Gene Engineering of Ministry of Education, Northeast Normal University, Changchun, 130024, China
| | - Luguo Sun
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China
| | - Shuyue Wang
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China
| | - Chunlei Yu
- Research Center of Agriculture and Medicine Gene Engineering of Ministry of Education, Northeast Normal University, Changchun, 130024, China
| | - Yanxin Huang
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun, 130024, China.
| |
Collapse
|
28
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
29
|
Grimes T, Datta S. SeqNet: An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data. J Stat Softw 2021; 98:10.18637/jss.v098.i12. [PMID: 34321962 PMCID: PMC8315007 DOI: 10.18637/jss.v098.i12] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Gene expression data provide an abundant resource for inferring connections in gene regulatory networks. While methodologies developed for this task have shown success, a challenge remains in comparing the performance among methods. Gold-standard datasets are scarce and limited in use. And while tools for simulating expression data are available, they are not designed to resemble the data obtained from RNA-seq experiments. SeqNet is an R package that provides tools for generating a rich variety of gene network structures and simulating RNA-seq data from them. This produces in silico RNA-seq data for benchmarking and assessing gene network inference methods. The package is available on CRAN and on GitHub at https://github.com/tgrimes/SeqNet.
Collapse
Affiliation(s)
- Tyler Grimes
- Univeristy of Florida, Department of Biostatistics
| | | |
Collapse
|
30
|
Muller IB, Meijers S, Kampstra P, van Dijk S, van Elswijk M, Lin M, Wojtuszkiewicz AM, Jansen G, de Jonge R, Cloos J. Computational comparison of common event-based differential splicing tools: practical considerations for laboratory researchers. BMC Bioinformatics 2021; 22:347. [PMID: 34174808 PMCID: PMC8236165 DOI: 10.1186/s12859-021-04263-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Accepted: 06/11/2021] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Computational tools analyzing RNA-sequencing data have boosted alternative splicing research by identifying and assessing differentially spliced genes. However, common alternative splicing analysis tools differ substantially in their statistical analyses and general performance. This report compares the computational performance (CPU utilization and RAM usage) of three event-level splicing tools; rMATS, MISO, and SUPPA2. Additionally, concordance between tool outputs was investigated. RESULTS Log-linear relations were found between job times and dataset size in all splicing tools and all virtual machine (VM) configurations. MISO had the highest job times for all analyses, irrespective of VM size, while MISO analyses also exceeded maximum CPU utilization on all VM sizes. rMATS and SUPPA2 load averages were relatively low in both size and replicate comparisons, not nearing maximum CPU utilization in the VM simulating the lowest computational power (D2 VM). RAM usage in rMATS and SUPPA2 did not exceed 20% of maximum RAM in both size and replicate comparisons while MISO reached maximum RAM usage in D2 VM analyses for input size. Correlation coefficients of differential splicing analyses showed high correlation (β > 80%) between different tool outputs with the exception of comparisons of retained intron (RI) events between rMATS/MISO and rMATS/SUPPA2 (β < 60%). CONCLUSIONS Prior to RNA-seq analyses, users should consider job time, amount of replicates and splice event type of interest to determine the optimal alternative splicing tool. In general, rMATS is superior to both MISO and SUPPA2 in computational performance. Analysis outputs show high concordance between tools, with the exception of RI events.
Collapse
Affiliation(s)
- Ittai B Muller
- Department of Clinical Chemistry, Amsterdam UMC - location VUmc, Amsterdam, The Netherlands
| | | | | | | | | | - Marry Lin
- Department of Clinical Chemistry, Amsterdam UMC - location VUmc, Amsterdam, The Netherlands
| | - Anna M Wojtuszkiewicz
- Department of Hematology, Cancer Center Amsterdam, Rm CCA 4.24, Amsterdam UMC - location VUmc, De Boelelaan 1117, 1081 HV, Amsterdam, The Netherlands
| | - Gerrit Jansen
- Amsterdam Rheumatology and immunology Center, Amsterdam UMC - location VUmc, Amsterdam, The Netherlands
| | - Robert de Jonge
- Department of Clinical Chemistry, Amsterdam UMC - location VUmc, Amsterdam, The Netherlands
| | - Jacqueline Cloos
- Department of Hematology, Cancer Center Amsterdam, Rm CCA 4.24, Amsterdam UMC - location VUmc, De Boelelaan 1117, 1081 HV, Amsterdam, The Netherlands.
| |
Collapse
|
31
|
Denti L, Pirola Y, Previtali M, Ceccato T, Della Vedova G, Rizzi R, Bonizzoni P. Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 2021; 37:464-472. [PMID: 32926128 PMCID: PMC8088329 DOI: 10.1093/bioinformatics/btaa779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 08/17/2020] [Accepted: 09/02/2020] [Indexed: 11/19/2022] Open
Abstract
Motivation Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. Results We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. Availability and implementation The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Luca Denti
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Marco Previtali
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Tamara Ceccato
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| |
Collapse
|
32
|
Shi X, Neuwald AF, Wang X, Wang TL, Hilakivi-Clarke L, Clarke R, Xuan J. IntAPT: integrated assembly of phenotype-specific transcripts from multiple RNA-seq profiles. Bioinformatics 2021; 37:650-658. [PMID: 33016988 PMCID: PMC8097681 DOI: 10.1093/bioinformatics/btaa852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Revised: 08/27/2020] [Accepted: 09/21/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High-throughput RNA sequencing has revolutionized the scope and depth of transcriptome analysis. Accurate reconstruction of a phenotype-specific transcriptome is challenging due to the noise and variability of RNA-seq data. This requires computational identification of transcripts from multiple samples of the same phenotype, given the underlying consensus transcript structure. RESULTS We present a Bayesian method, integrated assembly of phenotype-specific transcripts (IntAPT), that identifies phenotype-specific isoforms from multiple RNA-seq profiles. IntAPT features a novel two-layer Bayesian model to capture the presence of isoforms at the group layer and to quantify the abundance of isoforms at the sample layer. A spike-and-slab prior is used to model the isoform expression and to enforce the sparsity of expressed isoforms. Dependencies between the existence of isoforms and their expression are modeled explicitly to facilitate parameter estimation. Model parameters are estimated iteratively using Gibbs sampling to infer the joint posterior distribution, from which the presence and abundance of isoforms can reliably be determined. Studies using both simulations and real datasets show that IntAPT consistently outperforms existing methods for the IntAPT. Experimental results demonstrate that, despite sequencing errors, IntAPT exhibits a robust performance among multiple samples, resulting in notably improved identification of expressed isoforms of low abundance. AVAILABILITY AND IMPLEMENTATION The IntAPT package is available at http://github.com/henryxushi/IntAPT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xu Shi
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Andrew F Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Xiao Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Tian-Li Wang
- Department of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD 21231, USA
| | | | - Robert Clarke
- Hormel Institute, University of Minnesota, 801 16th Ave NE, Austin, MN 55912, USA
| | - Jianhua Xuan
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| |
Collapse
|
33
|
Chorlton SD. Reanalysis of Alzheimer's brain sequencing data reveals absence of purported HHV6A and HHV7. J Bioinform Comput Biol 2021; 18:2050012. [PMID: 32336252 DOI: 10.1142/s0219720020500122] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Readhead et al. recently reported in Neuron the detection and association of human herpesviruses 6A (HHV6A) and 7 (HHV7) with Alzheimer's disease by shotgun sequencing. I was skeptical of the specificity of their modified Viromescan bioinformatics method and subsequent analysis for numerous reasons. Using their supplementary data, the prevalence of variola virus, the etiological agent of the eradicated disease smallpox, can be calculated at 97.5% of their Mount Sinai Brain Bank dataset. Reanalysis of Readhead et al.'s data using highly sensitive and specific alternative methods finds no HHV7 reads in their samples; HHV6A reads were found in only 2 out of their top 15 samples sorted by reported HHV6A abundance. Finally, recreation of Readhead et al.'s modified Viromescan method identifies reasons for its low specificity.
Collapse
Affiliation(s)
- Samuel D Chorlton
- Department of Pathology and Laboratory Medicine, University of British Columbia, Canada
| |
Collapse
|
34
|
Davies P, Jones M, Liu J, Hebenstreit D. Anti-bias training for (sc)RNA-seq: experimental and computational approaches to improve precision. Brief Bioinform 2021; 22:6265204. [PMID: 33959753 PMCID: PMC8574610 DOI: 10.1093/bib/bbab148] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 03/10/2021] [Accepted: 03/26/2021] [Indexed: 12/29/2022] Open
Abstract
RNA-seq, including single cell RNA-seq (scRNA-seq), is plagued by insufficient sensitivity and lack of precision. As a result, the full potential of (sc)RNA-seq is limited. Major factors in this respect are the presence of global bias in most datasets, which affects detection and quantitation of RNA in a length-dependent fashion. In particular, scRNA-seq is affected by technical noise and a high rate of dropouts, where the vast majority of original transcripts is not converted into sequencing reads. We discuss these biases origins and implications, bioinformatics approaches to correct for them, and how biases can be exploited to infer characteristics of the sample preparation process, which in turn can be used to improve library preparation.
Collapse
Affiliation(s)
- Philip Davies
- Daniel Hebenstreit's Research Group University of Warwick, CV4 7AL Coventry, UK
| | - Matt Jones
- Daniel Hebenstreit's Research Group University of Warwick, CV4 7AL Coventry, UK
| | - Juntai Liu
- Physics Department, University of Warwick, CV4 7AL Coventry, UK
| | | |
Collapse
|
35
|
Fahmi NA, Nassereddeen H, Chang J, Park M, Yeh H, Sun J, Fan D, Yong J, Zhang W. AS-Quant: Detection and Visualization of Alternative Splicing Events with RNA-seq Data. Int J Mol Sci 2021; 22:ijms22094468. [PMID: 33922891 PMCID: PMC8123109 DOI: 10.3390/ijms22094468] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Revised: 04/19/2021] [Accepted: 04/22/2021] [Indexed: 02/06/2023] Open
Abstract
(1) Background: A simplistic understanding of the central dogma falls short in correlating the number of genes in the genome to the number of proteins in the proteome. Post-transcriptional alternative splicing contributes to the complexity of the proteome and is critical in understanding gene expression. mRNA-sequencing (RNA-seq) has been widely used to study the transcriptome and provides opportunity to detect alternative splicing events among different biological conditions. Despite the popularity of studying transcriptome variants with RNA-seq, few efficient and user-friendly bioinformatics tools have been developed for the genome-wide detection and visualization of alternative splicing events. (2) Results: We propose AS-Quant, (Alternative Splicing Quantitation), a robust program to identify alternative splicing events from RNA-seq data. We then extended AS-Quant to visualize the splicing events with short-read coverage plots along with complete gene annotation. The tool works in three major steps: (i) calculate the read coverage of the potential spliced exons and the corresponding gene; (ii) categorize the events into five different categories according to the annotation, and assess the significance of the events between two biological conditions; (iii) generate the short reads coverage plot for user specified splicing events. Our extensive experiments on simulated and real datasets demonstrate that AS-Quant outperforms the other three widely used baselines, SUPPA2, rMATS, and diffSplice for detecting alternative splicing events. Moreover, the significant alternative splicing events identified by AS-Quant between two biological contexts were validated by RT-PCR experiment. (3) Availability: AS-Quant is implemented in Python 3.0. Source code and a comprehensive user's manual are freely available online.
Collapse
Affiliation(s)
- Naima Ahmed Fahmi
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA; (N.A.F.); (J.S.)
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA;
| | - Heba Nassereddeen
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA;
- Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL 32816, USA
| | - Jaewoong Chang
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA; (J.C.); (M.P.); (H.Y.)
| | - Meeyeon Park
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA; (J.C.); (M.P.); (H.Y.)
| | - Hsinsung Yeh
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA; (J.C.); (M.P.); (H.Y.)
| | - Jiao Sun
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA; (N.A.F.); (J.S.)
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA;
| | - Deliang Fan
- School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287, USA;
| | - Jeongsik Yong
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA; (J.C.); (M.P.); (H.Y.)
- Correspondence: (J.Y.); (W.Z.)
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA; (N.A.F.); (J.S.)
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA;
- Correspondence: (J.Y.); (W.Z.)
| |
Collapse
|
36
|
Behera S, Voshall A, Moriyama EN. Plant Transcriptome Assembly: Review and Benchmarking. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
37
|
Estefania M, Andres R, Javier I, Marcelo Y, Ariel C. ASpli: Integrative analysis of splicing landscapes through RNA-Seq assays. Bioinformatics 2021; 37:2609-2616. [PMID: 33677494 DOI: 10.1093/bioinformatics/btab141] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 01/26/2021] [Accepted: 02/27/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Genome-wide analysis of alternative splicing has been a very active field of research since the early days of Next Generation Sequencing technologies. Since then, ever-growing data availability and the development of increasingly sophisticated analysis methods have uncovered the complexity of the general splicing repertoire. A large number of splicing analysis methodologies exist, each of them presenting its own strengths and weaknesses. For instance methods exclusively relying on junction information do not take advantage of the large majority of reads produced in an RNA-seq assay, isoform reconstruction methods might not detect novel intron retention events, some solutions can only handle canonical splicing events, and many existing methods can only perform pairwise comparisons. RESULTS In this contribution, we present ASpli, a computational suite implemented in R statistical language, that allows the identification of changes in both, annotated and novel alternative splicing events and can deal with simple, multi-factor or paired experimental designs. Our integrative computational workflow considers the same GLM model, applied to different sets of reads and junctions, in order to compute complementary splicing signals.Analyzing simulated and real data we found that the consolidation of these signals resulted in a robust proxy of the occurrence of splicing alterations. While the analysis of junctions allowed us to uncover annotated as well as non-annotated events, read coverage signals notably increased recall capabilities at a very competitive performance when compared against other state-of-the-art splicing analysis algorithms. ASpli is freely available from the Bioconductor project site https://www.bioconductor.org/packages/ASpli. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Rabinovich Andres
- Fundacion Instituto Leloir, Buenos Aires, Argentina.,Instituto de Investigaciones Bioquímicas de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Iserte Javier
- Fundacion Instituto Leloir, Buenos Aires, Argentina.,Instituto de Investigaciones Bioquímicas de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Yanovsky Marcelo
- Fundacion Instituto Leloir, Buenos Aires, Argentina.,Instituto de Investigaciones Bioquímicas de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Chernomoretz Ariel
- Fundacion Instituto Leloir, Buenos Aires, Argentina.,Departamento de Fisica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Instituto de Fisica de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| |
Collapse
|
38
|
Ma Y, Liu S, Gao J, Chen C, Zhang X, Yuan H, Chen Z, Yin X, Sun C, Mao Y, Zhou F, Shao Y, Liu Q, Xu J, Cheng L, Yu D, Li P, Yi P, He J, Geng G, Guo Q, Si Y, Zhao H, Li H, Banes GL, Liu H, Nakamura Y, Kurita R, Huang Y, Wang X, Wang F, Fang G, Engel JD, Shi L, Zhang YE, Yu J. Genome-wide analysis of pseudogenes reveals HBBP1's human-specific essentiality in erythropoiesis and implication in β-thalassemia. Dev Cell 2021; 56:478-493.e11. [PMID: 33476555 DOI: 10.1016/j.devcel.2020.12.019] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 11/16/2020] [Accepted: 12/28/2020] [Indexed: 02/05/2023]
Abstract
The human genome harbors 14,000 duplicated or retroposed pseudogenes. Given their functionality as regulatory RNAs and low conservation, we hypothesized that pseudogenes could shape human-specific phenotypes. To test this, we performed co-expression analyses and found that pseudogene exhibited tissue-specific expression, especially in the bone marrow. By incorporating genetic data, we identified a bone-marrow-specific duplicated pseudogene, HBBP1 (η-globin), which has been implicated in β-thalassemia. Extensive functional assays demonstrated that HBBP1 is essential for erythropoiesis by binding the RNA-binding protein (RBP), HNRNPA1, to upregulate TAL1, a key regulator of erythropoiesis. The HBBP1/TAL1 interaction contributes to a milder symptom in β-thalassemia patients. Comparative studies further indicated that the HBBP1/TAL1 interaction is human-specific. Genome-wide analyses showed that duplicated pseudogenes are often bound by RBPs and less commonly bound by microRNAs compared with retropseudogenes. Taken together, we not only demonstrate that pseudogenes can drive human evolution but also provide insights on their functional landscapes.
Collapse
Affiliation(s)
- Yanni Ma
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China.
| | - Siqi Liu
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Jie Gao
- State Key Laboratory of Experimental Hematology, National Clinical Research Center for Blood Diseases, Institute of Hematology & Blood Diseases Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Tianjin 300020, China
| | - Chunyan Chen
- Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Xin Zhang
- Laboratory of Molecular Cardiology & Medical Molecular Imaging, First Affiliated Hospital of Shantou University Medical College, Shantou 515041, China
| | - Hao Yuan
- Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhongyang Chen
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Xiaolin Yin
- 923rd Hospital of the Joint Logistics Support Force of the Chinese People's Liberation Army, Guangxi 530021, China
| | - Chenguang Sun
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Yanan Mao
- Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fanqi Zhou
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Yi Shao
- Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Qian Liu
- Shantou University Medical College, Shantou 515041, China
| | - Jiayue Xu
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Li Cheng
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China
| | - Daqi Yu
- Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Pingping Li
- 923rd Hospital of the Joint Logistics Support Force of the Chinese People's Liberation Army, Guangxi 530021, China
| | - Ping Yi
- Department of Obstetrics and Gynecology, the Third Affiliated Hospital of Chongqing Medical University (General Hospital), Chongqing 401120, China
| | - Jiahuan He
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Guangfeng Geng
- State Key Laboratory of Experimental Hematology, National Clinical Research Center for Blood Diseases, Institute of Hematology & Blood Diseases Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Tianjin 300020, China
| | - Qing Guo
- State Key Laboratory of Experimental Hematology, National Clinical Research Center for Blood Diseases, Institute of Hematology & Blood Diseases Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Tianjin 300020, China
| | - Yanmin Si
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Hualu Zhao
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Haipeng Li
- Chinese Academy of Sciences Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai 200031, China; CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| | - Graham L Banes
- Chinese Academy of Sciences Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai 200031, China; Wisconsin National Primate Research Center, University of Wisconsin Madison, 1220 Capitol Court, Madison, WI 53715, USA
| | - He Liu
- Beijing Key Laboratory of Captive Wildlife Technology, Beijing Zoo, Beijing 100044, China
| | - Yukio Nakamura
- Cell Engineering Division, RIKEN BioResource Research Center, Ibaraki 305-0074, Japan
| | - Ryo Kurita
- Department of Research and Development, Central Blood Institute, Japanese Red Cross Society, Tokyo 105-8521, Japan
| | - Yue Huang
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China
| | - Xiaoshuang Wang
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Fang Wang
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China
| | - Gang Fang
- NYU Shanghai, 1555 Century Avenue, Shanghai 20012, China; Department of Biology, 1009 Silver Center, New York University, New York, NY 10003, USA; School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China
| | - James Douglas Engel
- Department of Cell and Developmental Biology, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Lihong Shi
- State Key Laboratory of Experimental Hematology, National Clinical Research Center for Blood Diseases, Institute of Hematology & Blood Diseases Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Tianjin 300020, China.
| | - Yong E Zhang
- Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China; CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China; Chinese Institute for Brain Research, Beijing 102206, China.
| | - Jia Yu
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Science, Chinese Academy of Medical Sciences (CAMS) & School of Basic Medicine, Peking Union Medical College (PUMC), Beijing 100005, China; Key Laboratory of RNA and Hematopoietic Regulation, Chinese Academy of Medical Sciences, Beijing 100005, China; State Key Laboratory of Experimental Hematology, National Clinical Research Center for Blood Diseases, Institute of Hematology & Blood Diseases Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Tianjin 300020, China.
| |
Collapse
|
39
|
Varabyou A, Salzberg SL, Pertea M. Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments. Genome Res 2021; 31:301-308. [PMID: 33361112 PMCID: PMC7849408 DOI: 10.1101/gr.266213.120] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 12/18/2020] [Indexed: 12/25/2022]
Abstract
RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
40
|
Chen L, Lang K, Mei Y, Shi Z, He K, Li F, Xiao H, Ye G, Han Z. FastD: Fast detection of insecticide target-site mutations and overexpressed detoxification genes in insect populations from RNA-Seq data. Ecol Evol 2020; 10:14346-14358. [PMID: 33391720 PMCID: PMC7771117 DOI: 10.1002/ece3.7037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 08/26/2020] [Accepted: 09/21/2020] [Indexed: 11/24/2022] Open
Abstract
Target-site mutations and detoxification gene overexpression are two major mechanisms conferring insecticide resistance. Molecular assays applied to detect these resistance genetic markers are time-consuming and with high false-positive rates. RNA-Seq data contains information on the variations within expressed genomic regions and expression of detoxification genes. However, there is no corresponding method to detect resistance markers at present. Here, we collected 66 reported resistance mutations of four insecticide targets (AChE, VGSC, RyR, and nAChR) from 82 insect species. Next, we obtained 403 sequences of the four target genes and 12,665 sequences of three kinds of detoxification genes including P450s, GSTs, and CCEs. Then, we developed a Perl program, FastD, to detect target-site mutations and overexpressed detoxification genes from RNA-Seq data and constructed a web server for FastD (http://www.insect-genome.com/fastd). The estimation of FastD on simulated RNA-Seq data showed high sensitivity and specificity. We applied FastD to detect resistant markers in 15 populations of six insects, Plutella xylostella, Aphis gossypii, Anopheles arabiensis, Musca domestica, Leptinotarsa decemlineata and Apis mellifera. Results showed that 11 RyR mutations in P. xylostella, one nAChR mutation in A. gossypii, one VGSC mutation in A. arabiensis and five VGSC mutations in M. domestica were found to be with frequency difference >40% between resistant and susceptible populations including previously confirmed mutations G4946E in RyR, R81T in nAChR and L1014F in VGSC. And 49 detoxification genes were found to be overexpressed in resistant populations compared with susceptible populations including previously confirmed detoxification genes CYP6BG1, CYP6CY22, CYP6CY13, CYP6P3, CYP6M2, CYP6P4 and CYP4G16. The candidate target-site mutations and detoxification genes were worth further validation. Resistance estimates according to confirmed markers were consistent with population phenotypes, confirming the reliability of this program in predicting population resistance at omics-level.
Collapse
Affiliation(s)
- Longfei Chen
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
- Department of EntomologyNanjing Agricultural UniversityNanjingChina
| | - Kun Lang
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
- Department of EntomologyNanjing Agricultural UniversityNanjingChina
| | - Yang Mei
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Zhenmin Shi
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Kang He
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Fei Li
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Huamei Xiao
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
- Key Laboratory of Crop Growth and Development Regulation of Jiangxi ProvinceCollege of Life Sciences and Resource EnvironmentYichun UniversityYichunChina
| | - Gongyin Ye
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Zhaojun Han
- Department of EntomologyNanjing Agricultural UniversityNanjingChina
| |
Collapse
|
41
|
Yu T, Liu J, Gao X, Li G. iPAC: a genome-guided assembler of isoforms via phasing and combing paths. Bioinformatics 2020; 36:2712-2717. [PMID: 31985799 DOI: 10.1093/bioinformatics/btaa052] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 12/14/2019] [Accepted: 01/20/2020] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Full-length transcript reconstruction is very important and quite challenging for the widely used RNA-seq data analysis. Currently, available RNA-seq assemblers generally suffered from serious limitations in practical applications, such as low assembly accuracy and incompatibility with latest alignment tools. RESULTS We introduce iPAC, a new genome-guided assembler for reconstruction of isoforms, which revolutionizes the usage of paired-end and sequencing depth information via phasing and combing paths over a newly designed phasing graph. Tested on both simulated and real datasets, it is to some extent superior to all the salient assemblers of the same kind. Especially, iPAC is significantly powerful in recovery of lowly expressed transcripts while others are not. AVAILABILITY AND IMPLEMENTATION iPAC is freely available at http://sourceforge.net/projects/transassembly/files. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ting Yu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
42
|
Yu T, Mu Z, Fang Z, Liu X, Gao X, Liu J. TransBorrow: genome-guided transcriptome assembly by borrowing assemblies from different assemblers. Genome Res 2020; 30:1181-1190. [PMID: 32817072 PMCID: PMC7462071 DOI: 10.1101/gr.257766.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 06/18/2020] [Indexed: 12/12/2022]
Abstract
RNA-seq technology is widely used in various transcriptomic studies and provides great opportunities to reveal the complex structures of transcriptomes. To effectively analyze RNA-seq data, we introduce a novel transcriptome assembler, TransBorrow, which borrows the assemblies from different assemblers to search for reliable subsequences by building a colored graph from those borrowed assemblies. Then, by seeding reliable subsequences, a newly designed path extension strategy accurately searches for a transcript-representing path cover over each splicing graph. TransBorrow was tested on both simulated and real data sets and showed great superiority over all the compared leading assemblers.
Collapse
Affiliation(s)
- Ting Yu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Zengchao Mu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Zhaoyuan Fang
- Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| | - Xiaoping Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| |
Collapse
|
43
|
Blay N, Casas E, Galván-Femenía I, Graffelman J, de Cid R, Vavouri T. Assessment of kinship detection using RNA-seq data. Nucleic Acids Res 2020; 47:e136. [PMID: 31501877 PMCID: PMC6868348 DOI: 10.1093/nar/gkz776] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 08/23/2019] [Accepted: 08/29/2019] [Indexed: 01/23/2023] Open
Abstract
Analysis of RNA sequencing (RNA-seq) data from related individuals is widely used in clinical and molecular genetics studies. Prediction of kinship from RNA-seq data would be useful for confirming the expected relationships in family based studies and for highlighting samples from related individuals in case-control or population based studies. Currently, reconstruction of pedigrees is largely based on SNPs or microsatellites, obtained from genotyping arrays, whole genome sequencing and whole exome sequencing. Potential problems with using RNA-seq data for kinship detection are the low proportion of the genome that it covers, the highly skewed coverage of exons of different genes depending on expression level and allele-specific expression. In this study we assess the use of RNA-seq data to detect kinship between individuals, through pairwise identity by descent (IBD) estimates. First, we obtained high quality SNPs after successive filters to minimize the effects due to allelic imbalance as well as errors in sequencing, mapping and genotyping. Then, we used these SNPs to calculate pairwise IBD estimates. By analysing both real and simulated RNA-seq data we show that it is possible to identify up to second degree relationships using RNA-seq data of even low to moderate sequencing depth.
Collapse
Affiliation(s)
- Natalia Blay
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Josep Carreras Leukaemia Research Institute (IJC), Campus ICO-Germans Trias i Pujol, Universitat Autònoma de Barcelona, Badalona 08916, Spain.,Masters Programme in Bioinformatics and Biostatistics, Universitat Oberta de Catalunya (UOC), Barcelona 08035, Spain
| | - Eduard Casas
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Josep Carreras Leukaemia Research Institute (IJC), Campus ICO-Germans Trias i Pujol, Universitat Autònoma de Barcelona, Badalona 08916, Spain.,Doctoral Programme in Biomedicine, Universitat de Barcelona, Barcelona 08007, Spain
| | - Iván Galván-Femenía
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Genomes for Life - GCAT lab Group - Germans Trias i Pujol Research Institute, Can Ruti Campus, Ctra de Can Ruti, Camí de les Escoles s/n, Badalona, Barcelona 08916, Spain
| | - Jan Graffelman
- Department of Statistics and Operations Research Universitat Politècnica de Catalunya, Barcelona 08028, Spain.,Department of Biostatistics, University of Washington, Seattle, WA 98105-946, USA
| | - Rafael de Cid
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Genomes for Life - GCAT lab Group - Germans Trias i Pujol Research Institute, Can Ruti Campus, Ctra de Can Ruti, Camí de les Escoles s/n, Badalona, Barcelona 08916, Spain
| | - Tanya Vavouri
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Josep Carreras Leukaemia Research Institute (IJC), Campus ICO-Germans Trias i Pujol, Universitat Autònoma de Barcelona, Badalona 08916, Spain
| |
Collapse
|
44
|
McCarthy SD, González HE, Higgins BD. Future Trends in Nebulized Therapies for Pulmonary Disease. J Pers Med 2020; 10:E37. [PMID: 32397615 PMCID: PMC7354528 DOI: 10.3390/jpm10020037] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 05/05/2020] [Accepted: 05/07/2020] [Indexed: 12/15/2022] Open
Abstract
Aerosol therapy is a key modality for drug delivery to the lungs of respiratory disease patients. Aerosol therapy improves therapeutic effects by directly targeting diseased lung regions for rapid onset of action, requiring smaller doses than oral or intravenous delivery and minimizing systemic side effects. In order to optimize treatment of critically ill patients, the efficacy of aerosol therapy depends on lung morphology, breathing patterns, aerosol droplet characteristics, disease, mechanical ventilation, pharmacokinetics, and the pharmacodynamics of cell-drug interactions. While aerosol characteristics are influenced by drug formulations and device mechanisms, most other factors are reliant on individual patient variables. This has led to increased efforts towards more personalized therapeutic approaches to optimize pulmonary drug delivery and improve selection of effective drug types for individual patients. Vibrating mesh nebulizers (VMN) are the dominant device in clinical trials involving mechanical ventilation and emerging drugs. In this review, we consider the use of VMN during mechanical ventilation in intensive care units. We aim to link VMN fundamentals to applications in mechanically ventilated patients and look to the future use of VMN in emerging personalized therapeutic drugs.
Collapse
Affiliation(s)
- Sean D. McCarthy
- Anaesthesia, School of Medicine, National University of Ireland Galway, H91 TK33 Galway, Ireland; (S.D.M.); (H.E.G.)
- Lung Biology Group, Regenerative Medicine Institute (REMEDI) at CÚRAM Centre for Research in Medical Devices, National University of Ireland Galway, H91 TK33 Galway, Ireland
| | - Héctor E. González
- Anaesthesia, School of Medicine, National University of Ireland Galway, H91 TK33 Galway, Ireland; (S.D.M.); (H.E.G.)
- Lung Biology Group, Regenerative Medicine Institute (REMEDI) at CÚRAM Centre for Research in Medical Devices, National University of Ireland Galway, H91 TK33 Galway, Ireland
| | - Brendan D. Higgins
- Physiology, School of Medicine, National University of Ireland Galway, H91 TK33 Galway, Ireland
| |
Collapse
|
45
|
Zhao J, Feng H, Zhu D, Zhang C, Xu Y. IsoTree: A New Framework for de novo Transcriptome Assembly from RNA-seq Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:938-948. [PMID: 29994455 DOI: 10.1109/tcbb.2018.2808350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
High-throughput sequencing of mRNA has made the deep and efficient probing of transcriptome more affordable. However, the vast amounts of short RNA-seq reads make de novo transcriptome assembly an algorithmic challenge. In this work, we present IsoTree, a novel framework for transcripts reconstruction in the absence of reference genomes. Unlike most of de novo assembly methods that build de Bruijn graph or splicing graph by connecting k- mers which are sets of overlapping substrings generated from reads, IsoTree constructs splicing graph by connecting reads directly. For each splicing graph, IsoTree applies an iterative scheme of mixed integer linear program to build a prefix tree, called isoform tree. Each path from the root node of the isoform tree to a leaf node represents a plausible transcript candidate which will be pruned based on the information of paired-end reads. Experiments showed that in most cases IsoTree performs better than other leading transcriptome assembly programs. IsoTree is available at https://github.com/Jane110111107/IsoTree.
Collapse
|
46
|
Li S, Wang Y, Zhao Y, Zhao X, Chen X, Gong Z. Global Co-transcriptional Splicing in Arabidopsis and the Correlation with Splicing Regulation in Mature RNAs. MOLECULAR PLANT 2020; 13:266-277. [PMID: 31759129 PMCID: PMC8034514 DOI: 10.1016/j.molp.2019.11.003] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Revised: 11/01/2019] [Accepted: 11/07/2019] [Indexed: 05/20/2023]
Abstract
RNA splicing and spliceosome assembly in eukaryotes occur mainly during transcription. However, co-transcriptional splicing has not yet been explored in plants. Here, we built transcriptomes of nascent chromatin RNAs in Arabidopsis thaliana and showed that nearly all introns undergo co-transcriptional splicing, which occurs with higher efficiency for introns in protein-coding genes than for those in noncoding RNAs. Total intron number and intron position are two predominant features that correlate with co-transcriptional splicing efficiency, and introns with alternative 5' or 3' splice sites are less efficiently spliced. Furthermore, we found that mutations in genes encoding trans-acting proteins lead to more introns with increased splicing defects in nascent RNAs than in mature RNAs, and that introns with increased splicing defects in mature RNAs are inefficiently spliced at the co-transcriptional level. Collectively, our results not only uncovered widespread co-transcriptional splicing in Arabidopsis but also identified features that may affect or be affected by co-transcriptional splicing efficiency.
Collapse
Affiliation(s)
- Shaofang Li
- State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing 100193, China; Joint Laboratory for International Cooperation in Crop Molecular Breeding, China Agricultural University, Beijing 100193, China.
| | - Yuan Wang
- Department of Botany and Plant Sciences, Institute of Integrative Genome Biology, University of California, Riverside, CA 92521, USA; Guangdong Provincial Key Laboratory for Plant Epigenetics, Longhua Institute of Innovative Biotechnology, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen 518060, China; Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen 518060, China
| | - Yonghui Zhao
- Department of Botany and Plant Sciences, Institute of Integrative Genome Biology, University of California, Riverside, CA 92521, USA; Plant Phenomics Research Center, Nanjing Agricultural University, Nanjing 210018, China
| | - Xinjie Zhao
- State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing 100193, China; Joint Laboratory for International Cooperation in Crop Molecular Breeding, China Agricultural University, Beijing 100193, China
| | - Xuemei Chen
- Department of Botany and Plant Sciences, Institute of Integrative Genome Biology, University of California, Riverside, CA 92521, USA
| | - Zhizhong Gong
- State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing 100193, China; Joint Laboratory for International Cooperation in Crop Molecular Breeding, China Agricultural University, Beijing 100193, China
| |
Collapse
|
47
|
Yu X, Liu X. Mapping RNA-seq reads to transcriptomes efficiently based on learning to hash method. Comput Biol Med 2020; 116:103539. [DOI: 10.1016/j.compbiomed.2019.103539] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 11/10/2019] [Accepted: 11/10/2019] [Indexed: 11/25/2022]
|
48
|
Zhao J, Feng H, Zhu D, Zhang C, Xu Y. DTA-SiST: de novo transcriptome assembly by using simplified suffix trees. BMC Bioinformatics 2019; 20:698. [PMID: 31874618 PMCID: PMC6929406 DOI: 10.1186/s12859-019-3272-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.
Collapse
Affiliation(s)
- Jin Zhao
- School of Computer Science and Technology, Shandong University, Binhai Road, Qingdao, Shandong, People's Republic of China
| | - Haodi Feng
- School of Computer Science and Technology, Shandong University, Binhai Road, Qingdao, Shandong, People's Republic of China.
| | - Daming Zhu
- School of Computer Science and Technology, Shandong University, Binhai Road, Qingdao, Shandong, People's Republic of China
| | - Chi Zhang
- Department of Medical and Molecular Genetics and Center for Computational Biology and Bioinformatics, Indiana University, Indianapolis, IN, USA
| | - Ying Xu
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA
| |
Collapse
|
49
|
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019; 20:278. [PMID: 31842956 PMCID: PMC6912988 DOI: 10.1186/s13059-019-1910-1] [Citation(s) in RCA: 791] [Impact Index Per Article: 158.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Accepted: 12/02/2019] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.
Collapse
Affiliation(s)
- Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
| | - Aleksey V. Zimin
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Geo M. Pertea
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Roham Razaghi
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Steven L. Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205 USA
| | - Mihaela Pertea
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| |
Collapse
|
50
|
Di Bella S, La Ferlita A, Carapezza G, Alaimo S, Isacchi A, Ferro A, Pulvirenti A, Bosotti R. A benchmarking of pipelines for detecting ncRNAs from RNA-Seq data. Brief Bioinform 2019; 21:1987-1998. [PMID: 31740918 DOI: 10.1093/bib/bbz110] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 07/12/2019] [Accepted: 08/01/2019] [Indexed: 12/18/2022] Open
Abstract
Next-Generation Sequencing (NGS) is a high-throughput technology widely applied to genome sequencing and transcriptome profiling. RNA-Seq uses NGS to reveal RNA identities and quantities in a given sample. However, it produces a huge amount of raw data that need to be preprocessed with fast and effective computational methods. RNA-Seq can look at different populations of RNAs, including ncRNAs. Indeed, in the last few years, several ncRNAs pipelines have been developed for ncRNAs analysis from RNA-Seq experiments. In this paper, we analyze eight recent pipelines (iSmaRT, iSRAP, miARma-Seq, Oasis 2, SPORTS1.0, sRNAnalyzer, sRNApipe, sRNA workbench) which allows the analysis not only of single specific classes of ncRNAs but also of more than one ncRNA classes. Our systematic performance evaluation aims at guiding users to select the appropriate pipeline for processing each ncRNA class, focusing on three key points: (i) accuracy in ncRNAs identification, (ii) accuracy in read count estimation and (iii) deployment and ease of use.
Collapse
Affiliation(s)
| | - Alessandro La Ferlita
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy.,Department of Physics and Astronomy, University of Catania, Catania, Italy
| | | | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy
| | | | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy
| | | |
Collapse
|