1
|
Chen S, Wang H, Zhang D, Chen R, Luo J. Readon: a novel algorithm to identify read-through transcripts with long-read sequencing data. Bioinformatics 2024; 40:btae336. [PMID: 38808568 PMCID: PMC11162696 DOI: 10.1093/bioinformatics/btae336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 04/30/2024] [Accepted: 05/26/2024] [Indexed: 05/30/2024] Open
Abstract
MOTIVATION There are many clustered transcriptionally active regions in the human genome, in which the transcription complex cannot immediately terminate transcription at the upstream gene termination site, but instead continues to transcribe intergenic regions and downstream genes, resulting in read-through transcripts. Several studies have demonstrated the regulatory roles of read-through transcripts in tumorigenesis and development. However, limited by the read length of next-generation sequencing, discovery of read-through transcripts has been slow. For long but also erroneous third-generation sequencing data, this study developed a novel minimizer sketch algorithm to accurately and quickly identify read-through transcripts. RESULTS Readon initially splits the reference sequence into distinct active regions. It employs a sliding window approach within each region, calculates minimizers, and constructs the specialized structured arrays for query indexing. Following initial alignment anchor screening of candidate read-through transcripts, further confirmation steps are executed. Comparative assessments against existing software reveal Readon's superior performance on both simulated and validated real data. Additionally, two downstream tools are provided: one for predicting whether a read-through transcript is likely to undergo nonsense-mediated decay or encodes a protein, and another for visualizing splicing patterns. AVAILABILITY AND IMPLEMENTATION Readon is freely available on GitHub (https://github.com/Bulabula45/Readon).
Collapse
Affiliation(s)
- Siang Chen
- Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Wang
- Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Dongdong Zhang
- Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
| | - Runsheng Chen
- Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jianjun Luo
- Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
2
|
Wang W, Li Y, Ko S, Feng N, Zhang M, Liu JJ, Zheng S, Ren B, Yu YP, Luo JH, Tseng GC, Liu S. IFDlong: an isoform and fusion detector for accurate annotation and quantification of long-read RNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.11.593690. [PMID: 38798496 PMCID: PMC11118288 DOI: 10.1101/2024.05.11.593690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Advancements in long-read transcriptome sequencing (long-RNA-seq) technology have revolutionized the study of isoform diversity. These full-length transcripts enhance the detection of various transcriptome structural variations, including novel isoforms, alternative splicing events, and fusion transcripts. By shifting the open reading frame or altering gene expressions, studies have proved that these transcript alterations can serve as crucial biomarkers for disease diagnosis and therapeutic targets. In this project, we proposed IFDlong, a bioinformatics and biostatistics tool to detect isoform and fusion transcripts using bulk or single-cell long-RNA-seq data. Specifically, the software performed gene and isoform annotation for each long-read, defined novel isoforms, quantified isoform expression by a novel expectation-maximization algorithm, and profiled the fusion transcripts. For evaluation, IFDlong pipeline achieved overall the best performance when compared with several existing tools in large-scale simulation studies. In both isoform and fusion transcript quantification, IFDlong is able to reach more than 0.8 Spearman's correlation with the truth, and more than 0.9 cosine similarity when distinguishing multiple alternative splicing events. In novel isoform simulation, IFDlong can successfully balance the sensitivity (higher than 90%) and specificity (higher than 90%). Furthermore, IFDlong has proved its accuracy and robustness in diverse in-house and public datasets on healthy tissues, cell lines and multiple types of diseases. Besides bulk long-RNA-seq, IFDlong pipeline has proved its compatibility to single-cell long-RNA-seq data. This new software may hold promise for significant impact on long-read transcriptome analysis. The IFDlong software is available at https://github.com/wenjiaking/IFDlong.
Collapse
Affiliation(s)
- Wenjia Wang
- Department of Biostatistics, School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Yuzhen Li
- Department of Surgery, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Sungjin Ko
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Ning Feng
- Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Manling Zhang
- Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Jia-Jun Liu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Songyang Zheng
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Baoguo Ren
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Yan P. Yu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Jian-Hua Luo
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA
| | - George C. Tseng
- Department of Biostatistics, School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Silvia Liu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA
- Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| |
Collapse
|
3
|
Karaoğlanoğlu F, Orabi B, Flannigan R, Chauve C, Hach F. TKSM: highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator. Bioinformatics 2024; 40:btae051. [PMID: 38273664 PMCID: PMC10868325 DOI: 10.1093/bioinformatics/btae051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 01/10/2024] [Accepted: 01/23/2024] [Indexed: 01/27/2024] Open
Abstract
MOTIVATION Transcriptomic long-read (LR) sequencing is an increasingly cost-effective technology for probing various RNA features. Numerous tools have been developed to tackle various transcriptomic sequencing tasks (e.g. isoform and gene fusion detection). However, the lack of abundant gold-standard datasets hinders the benchmarking of such tools. Therefore, the simulation of LR sequencing is an important and practical alternative. While the existing LR simulators aim to imitate the sequencing machine noise and to target specific library protocols, they lack some important library preparation steps (e.g. PCR) and are difficult to modify to new and changing library preparation techniques (e.g. single-cell LRs). RESULTS We present TKSM, a modular and scalable LR simulator, designed so that each RNA modification step is targeted explicitly by a specific module. This allows the user to assemble a simulation pipeline as a combination of TKSM modules to emulate a specific sequencing design. Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps. AVAILABILITY AND IMPLEMENTATION TKSM is available as an open source software at https://github.com/vpc-ccg/tksm.
Collapse
Affiliation(s)
- Fatih Karaoğlanoğlu
- Computing Science Department, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Baraa Orabi
- Department of Computer Science, the University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Ryan Flannigan
- Department of Urologic Sciences, the University of British Columbia, Vancouver, BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Faraz Hach
- Department of Computer Science, the University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Department of Urologic Sciences, the University of British Columbia, Vancouver, BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| |
Collapse
|
4
|
Dorney R, Dhungel BP, Rasko JEJ, Hebbard L, Schmitz U. Recent advances in cancer fusion transcript detection. Brief Bioinform 2022; 24:6918739. [PMID: 36527429 PMCID: PMC9851307 DOI: 10.1093/bib/bbac519] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 10/11/2022] [Accepted: 10/31/2022] [Indexed: 12/23/2022] Open
Abstract
Extensive investigation of gene fusions in cancer has led to the discovery of novel biomarkers and therapeutic targets. To date, most studies have neglected chromosomal rearrangement-independent fusion transcripts and complex fusion structures such as double or triple-hop fusions, and fusion-circRNAs. In this review, we untangle fusion-related terminology and propose a classification system involving both gene and transcript fusions. We highlight the importance of RNA-level fusions and how long-read sequencing approaches can improve detection and characterization. Moreover, we discuss novel bioinformatic tools to identify fusions in long-read sequencing data and strategies to experimentally validate and functionally characterize fusion transcripts.
Collapse
Affiliation(s)
- Ryley Dorney
- epartment of Molecular & Cell Biology, College of Public Health, Medical & Vet Sciences, James Cook University, Douglas, QLD 4811, Australia,Centre for Tropical Bioinformatics and Molecular Biology, Australian Institute of Tropical Health and Medicine, James Cook University, Cairns 4878, Australia
| | - Bijay P Dhungel
- Gene and Stem Cell Therapy Program Centenary Institute, The University of Sydney, Camperdown, NSW 2050, Australia,Faculty of Medicine & Health, The University of Sydney, Camperdown, NSW 2006, Australia,Centre for Tropical Bioinformatics and Molecular Biology, Australian Institute of Tropical Health and Medicine, James Cook University, Cairns 4878, Australia
| | - John E J Rasko
- Gene and Stem Cell Therapy Program Centenary Institute, The University of Sydney, Camperdown, NSW 2050, Australia,Faculty of Medicine & Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Lionel Hebbard
- epartment of Molecular & Cell Biology, College of Public Health, Medical & Vet Sciences, James Cook University, Douglas, QLD 4811, Australia,Storr Liver Centre, Westmead Institute for Medical Research, Westmead Hospital and University of Sydney, Sydney, New South Wales, Australia
| | - Ulf Schmitz
- Corresponding author. Ulf Schmitz, Department of Molecular and Cell Biology, College of Public Health, Medical and Vet Sciences, James Cook University, Douglas, QLD 4811, Australia. E-mail:
| |
Collapse
|