1
|
Murali M, Saquing J, Lu S, Gao Z, Jordan B, Wakefield ZP, Fiszbein A, Cooper DR, Castaldi PJ, Korkin D, Sheynkman G. Biosurfer for systematic tracking of regulatory mechanisms leading to protein isoform diversity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585320. [PMID: 38559226 PMCID: PMC10980011 DOI: 10.1101/2024.03.15.585320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Long-read RNA sequencing has shed light on transcriptomic complexity, but questions remain about the functionality of downstream protein products. We introduce Biosurfer, a computational approach for comparing protein isoforms, while systematically tracking the transcriptional, splicing, and translational variations that underlie differences in the sequences of the protein products. Using Biosurfer, we analyzed the differences in 32,799 pairs of GENCODE annotated protein isoforms, finding a majority (70%) of variable N-termini are due to the alternative transcription start sites, while only 9% arise from 5' UTR alternative splicing. Biosurfer's detailed tracking of nucleotide-to-residue relationships helped reveal an uncommonly tracked source of single amino acid residue changes arising from the codon splits at junctions. For 17% of internal sequence changes, such split codon patterns lead to single residue differences, termed "ragged codons". Of variable C-termini, 72% involve splice- or intron retention-induced reading frameshifts. We found an unusual pattern of reading frame changes, in which the first frameshift is closely followed by a distinct second frameshift that restores the original frame, which we term a "snapback" frameshift. We analyzed long read RNA-seq-predicted proteome of a human cell line and found similar trends as compared to our GENCODE analysis, with the exception of a higher proportion of isoforms predicted to undergo nonsense-mediated decay. Biosurfer's comprehensive characterization of long-read RNA-seq datasets should accelerate insights of the functional role of protein isoforms, providing mechanistic explanation of the origins of the proteomic diversity driven by the alternative splicing. Biosurfer is available as a Python package at https://github.com/sheynkman-lab/biosurfer.
Collapse
Affiliation(s)
- Mayank Murali
- Broad Institute of MIT and Harvard University, Cambridge, MA, USA
| | - Jamie Saquing
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Senbao Lu
- Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, MA, USA
- Computer Science Department, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Ziyang Gao
- Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, MA, USA
- Computer Science Department, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Ben Jordan
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Zachary Peters Wakefield
- Bioinformatics Program, Boston University, Boston, MA, USA
- Department of Biology, Boston University, Boston, MA, USA
| | - Ana Fiszbein
- Bioinformatics Program, Boston University, Boston, MA, USA
- Department of Biology, Boston University, Boston, MA, USA
| | - David R. Cooper
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Peter J. Castaldi
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Division of General Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Dmitry Korkin
- Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, MA, USA
- Computer Science Department, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- UVA Cancer Center, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
2
|
Calvo-Roitberg E, Carroll CL, Venev SV, Kim G, Mick ST, Dekker J, Fiszbein A, Pai AA. mRNA initiation and termination are spatially coordinated. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.05.574404. [PMID: 38260419 PMCID: PMC10802295 DOI: 10.1101/2024.01.05.574404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
The expression of a precise mRNA transcriptome is crucial for establishing cell identity and function, with dozens of alternative isoforms produced for a single gene sequence. The regulation of mRNA isoform usage occurs by the coordination of co-transcriptional mRNA processing mechanisms across a gene. Decisions involved in mRNA initiation and termination underlie the largest extent of mRNA isoform diversity, but little is known about any relationships between decisions at both ends of mRNA molecules. Here, we systematically profile the joint usage of mRNA transcription start sites (TSSs) and polyadenylation sites (PASs) across tissues and species. Using both short and long read RNA-seq data, we observe that mRNAs preferentially using upstream TSSs also tend to use upstream PASs, and congruently, the usage of downstream sites is similarly paired. This observation suggests that mRNA 5' end choice may directly influence mRNA 3' ends. Our results suggest a novel "Positional Initiation-Termination Axis" (PITA), in which the usage of alternative terminal sites are coupled based on the order in which they appear in the genome. PITA isoforms are more likely to encode alternative protein domains and use conserved sites. PITA is strongly associated with the length of genomic features, such that PITA is enriched in longer genes with more area devoted to regions that regulate alternative 5' or 3' ends. Strikingly, we found that PITA genes are more likely than non-PITA genes to have multiple, overlapping chromatin structural domains related to pairing of ordinally coupled start and end sites. In turn, PITA coupling is also associated with fast RNA Polymerase II (RNAPII) trafficking across these long gene regions. Our findings indicate that a combination of spatial and kinetic mechanisms couple transcription initiation and mRNA 3' end decisions based on ordinal position to define the expression mRNA isoforms.
Collapse
Affiliation(s)
| | | | - Sergey V. Venev
- Department of Systems Biology, University Massachusetts Chan Medical School, Worcester, MA
| | - GyeungYun Kim
- Department of Biology, Boston University, Boston, MA
| | | | - Job Dekker
- Department of Systems Biology, University Massachusetts Chan Medical School, Worcester, MA
- Howard Hughes Medical Institute, Chevy Chase, MD
| | - Ana Fiszbein
- Department of Biology, Boston University, Boston, MA
- Center for Computing & Data Sciences, Boston University, Boston, MA
| | - Athma A. Pai
- RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, MA
| |
Collapse
|
3
|
Gu X, Wang M, Zhang XO. TE-TSS: an integrated data resource of human and mouse transposable element (TE)-derived transcription start site (TSS). Nucleic Acids Res 2024; 52:D322-D333. [PMID: 37956335 PMCID: PMC10767810 DOI: 10.1093/nar/gkad1048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 10/21/2023] [Accepted: 10/23/2023] [Indexed: 11/15/2023] Open
Abstract
Transposable elements (TEs) are abundant in the genome and serve as crucial regulatory elements. Some TEs function as epigenetically regulated promoters, and these TE-derived transcription start sites (TSSs) play a crucial role in regulating genes associated with specific functions, such as cancer and embryogenesis. However, the lack of an accessible database that systematically gathers TE-derived TSS data is a current research gap. To address this, we established TE-TSS, an integrated data resource of human and mouse TE-derived TSSs (http://xozhanglab.com/TETSS). TE-TSS has compiled 2681 RNA sequencing datasets, spanning various tissues, cell lines and developmental stages. From these, we identified 5768 human TE-derived TSSs and 2797 mouse TE-derived TSSs, with 47% and 38% being experimentally validated, respectively. TE-TSS enables comprehensive exploration of TSS usage in diverse samples, providing insights into tissue-specific gene expression patterns and transcriptional regulatory elements. Furthermore, TE-TSS compares TE-derived TSS regions across 15 mammalian species, enhancing our understanding of their evolutionary and functional aspects. The establishment of TE-TSS facilitates further investigations into the roles of TEs in shaping the transcriptomic landscape and offers valuable resources for comprehending their involvement in diverse biological processes.
Collapse
Affiliation(s)
- Xiaobing Gu
- Shanghai Key Laboratory of Maternal and Fetal Medicine, Clinical and Translational Research Center of Shanghai First Maternity and Infant Hospital, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Mingdong Wang
- Shanghai Key Laboratory of Maternal and Fetal Medicine, Clinical and Translational Research Center of Shanghai First Maternity and Infant Hospital, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiao-Ou Zhang
- Shanghai Key Laboratory of Maternal and Fetal Medicine, Clinical and Translational Research Center of Shanghai First Maternity and Infant Hospital, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| |
Collapse
|
4
|
Calvo-Roitberg E, Daniels RF, Pai AA. Challenges in identifying mRNA transcript starts and ends from long-read sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.26.550536. [PMID: 37546743 PMCID: PMC10402045 DOI: 10.1101/2023.07.26.550536] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology, especially by enabling the comprehensive identification and quantification of full length mRNA isoforms. However, inherently high error rates make the analysis of long-read sequencing data challenging. While these error rates have been characterized for sequence and splice site identification, it is still unclear how accurately LRS reads represent transcript start and end sites. Here, we systematically assess the variability and accuracy of mRNA terminal ends identified by LRS reads across multiple sequencing platforms. We find substantial inconsistencies in both the start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. To address this challenge, we introduce an approach to condition reads based on empirically derived terminal ends and identified a subset of reads that are more likely to represent full-length transcripts. Our approach can improve transcriptome analyses by enhancing the fidelity of transcript terminal end identification, but may result in lower power to quantify genes or discover novel isoforms. Thus, it is necessary to be cautious when selecting sequencing approaches and/or interpreting data from long-read RNA sequencing.
Collapse
Affiliation(s)
| | - Rachel F Daniels
- RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, MA
| | - Athma A Pai
- RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, MA
| |
Collapse
|
5
|
Uriostegui-Arcos M, Mick ST, Shi Z, Rahman R, Fiszbein A. Splicing activates transcription from weak promoters upstream of alternative exons. Nat Commun 2023; 14:3435. [PMID: 37301863 PMCID: PMC10256964 DOI: 10.1038/s41467-023-39200-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Accepted: 06/02/2023] [Indexed: 06/12/2023] Open
Abstract
Transcription and splicing are intrinsically coupled. Alternative splicing of internal exons can fine-tune gene expression through a recently described phenomenon called exon-mediated activation of transcription starts (EMATS). However, the association of this phenomenon with human diseases remains unknown. Here, we develop a strategy to activate gene expression through EMATS and demonstrate its potential for treatment of genetic diseases caused by loss of expression of essential genes. We first identified a catalog of human EMATS genes and provide a list of their pathological variants. To test if EMATS can be used to activate gene expression, we constructed stable cell lines expressing a splicing reporter based on the alternative splicing of motor neuron 2 (SMN2) gene. Using small molecules and antisense oligonucleotides (ASOs) currently used for treatment of spinal muscular atrophy, we demonstrated that increase of inclusion of alternative exons can trigger an activation of gene expression up to 45-fold by enhancing transcription in EMATS-like genes. We observed the strongest effects in genes under the regulation of weak human promoters located proximal to highly included skipped exons.
Collapse
Affiliation(s)
| | - Steven T Mick
- Biology Department, Boston University, Boston, 02215, USA
| | - Zhuo Shi
- Biology Department, Massachusetts Institute of Technology, Cambridge, 02139, USA
| | - Rufuto Rahman
- Biology Department, Boston University, Boston, 02215, USA
| | - Ana Fiszbein
- Biology Department, Boston University, Boston, 02215, USA.
| |
Collapse
|
6
|
Vlasenok M, Margasyuk S, Pervouchine DD. Transcriptome sequencing suggests that pre-mRNA splicing counteracts widespread intronic cleavage and polyadenylation. NAR Genom Bioinform 2023; 5:lqad051. [PMID: 37260513 PMCID: PMC10227441 DOI: 10.1093/nargab/lqad051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 05/09/2023] [Accepted: 05/17/2023] [Indexed: 06/02/2023] Open
Abstract
Alternative splicing (AS) and alternative polyadenylation (APA) are two crucial steps in the post-transcriptional regulation of eukaryotic gene expression. Protocols capturing and sequencing RNA 3'-ends have uncovered widespread intronic polyadenylation (IPA) in normal and disease conditions, where it is currently attributed to stochastic variations in the pre-mRNA processing. Here, we took advantage of the massive amount of RNA-seq data generated by the Genotype Tissue Expression project (GTEx) to simultaneously identify and match tissue-specific expression of intronic polyadenylation sites with tissue-specific splicing. A combination of computational methods including the analysis of short reads with non-templated adenines revealed that APA events are more abundant in introns than in exons. While the rate of IPA in composite terminal exons and skipped terminal exons expectedly correlates with splicing, we observed a considerable fraction of IPA events that lack AS support and attributed them to spliced polyadenylated introns (SPI). We hypothesize that SPIs represent transient byproducts of a dynamic coupling between APA and AS, in which the spliceosome removes the intron while it is being cleaved and polyadenylated. These findings indicate that cotranscriptional pre-mRNA splicing could serve as a rescue mechanism to suppress premature transcription termination at intronic polyadenylation sites.
Collapse
Affiliation(s)
- Maria Vlasenok
- Center for Molecular and Cellular Biology, Skolkovo Institute of Science and Technology, Bolshoy Bulvar 30, Moscow 121205, Russia
| | - Sergey Margasyuk
- Center for Molecular and Cellular Biology, Skolkovo Institute of Science and Technology, Bolshoy Bulvar 30, Moscow 121205, Russia
| | - Dmitri D Pervouchine
- Center for Molecular and Cellular Biology, Skolkovo Institute of Science and Technology, Bolshoy Bulvar 30, Moscow 121205, Russia
| |
Collapse
|
7
|
Reprogramming RNA processing: an emerging therapeutic landscape. Trends Pharmacol Sci 2022; 43:437-454. [PMID: 35331569 DOI: 10.1016/j.tips.2022.02.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 02/22/2022] [Accepted: 02/24/2022] [Indexed: 12/13/2022]
Abstract
The production of a mature mRNA requires coordination of multiple processing steps, which ultimately control its content, localization, and stability. These steps include some of the largest macromolecular machines in the cell, which were, until recently, considered undruggable due to their biological complexity. Building from an expanded understanding of the underlying mechanisms that drive these processes, a new wave of therapeutics is seeking to target RNA processing. With a focus on impacting gene regulation at the RNA level, such modalities offer potential for sequence-specific resolution in drug design. Here, we review our current understanding of RNA-processing events and their role in gene regulation, with a focus on the therapeutic opportunities that have emerged within this landscape.
Collapse
|