1
|
Sullivan D, Hjörleifsson K, Swarna N, Oakes C, Holley G, Melsted P, Pachter L. Accurate quantification of nascent and mature RNAs from single-cell and single-nucleus RNA-seq. Nucleic Acids Res 2025; 53:gkae1137. [PMID: 39657125 PMCID: PMC11724275 DOI: 10.1093/nar/gkae1137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 10/28/2024] [Accepted: 12/05/2024] [Indexed: 12/14/2024] Open
Abstract
In single-cell and single-nucleus RNA sequencing (RNA-seq), the coexistence of nascent (unprocessed) and mature (processed) messenger RNA (mRNA) poses challenges in accurate read mapping and the interpretation of count matrices. The traditional transcriptome reference, defining the "region of interest" in bulk RNA-seq, restricts its focus to mature mRNA transcripts. This restriction leads to two problems: reads originating outside of the "region of interest" are prone to mismapping within this region, and additionally, such external reads cannot be matched to specific transcript targets. Expanding the "region of interest" to encompass both nascent and mature mRNA transcript targets provides a more comprehensive framework for RNA-seq analysis. Here, we introduce the concept of distinguishing flanking k-mers (DFKs) to improve mapping of sequencing reads. We have developed an algorithm to identify DFKs, which serve as a sophisticated "background filter", enhancing the accuracy of mRNA quantification. This dual strategy of an expanded region of interest coupled with the use of DFKs enhances the precision in quantifying both mature and nascent mRNA molecules, as well as in delineating reads of ambiguous status.
Collapse
Affiliation(s)
- Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, 885 Tiverton Drive, Los Angeles, CA 90095, USA
| | - Kristján Eldjárn Hjörleifsson
- Department of Computing and Mathematical Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Nikhila P Swarna
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Conrad Oakes
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Guillaume Holley
- deCODE Genetics/Amgen Inc., Sturlugata 8, 101 Reykjavík, Iceland
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Sturlugata 8, 101 Reykjavík, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Sæmundargata 2, 102 Reykjavík, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
2
|
Kuehl M, Wong MN, Wanner N, Bonn S, Puelles VG. Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python. Bioinformatics 2024; 40:btae700. [PMID: 39565903 PMCID: PMC11629965 DOI: 10.1093/bioinformatics/btae700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 10/08/2024] [Accepted: 11/18/2024] [Indexed: 11/22/2024] Open
Abstract
SUMMARY Transcript quantification tools efficiently map bulk RNA sequencing (RNA-seq) reads to reference transcriptomes. However, their output consists of transcript count estimates that are subject to multiple biases and cannot be readily used with existing differential gene expression analysis tools in Python.Here we present pytximport, a Python implementation of the tximport R package that supports a variety of input formats, different modes of bias correction, inferential replicates, gene-level summarization of transcript counts, transcript-level exports, transcript-to-gene mapping generation, and optional filtering of transcripts by biotype. pytximport is part of the scverse ecosystem of open-source Python software packages for omics analyses and includes both a Python as well as a command-line interface.With pytximport, we propose a bulk RNA-seq analysis workflow based on Bioconda and scverse ecosystem packages, ensuring reproducible analyses through Snakemake rules. We apply this pipeline to a publicly available RNA-seq dataset, demonstrating how pytximport enables the creation of Python-centric workflows capable of providing insights into transcriptomic alterations. AVAILABILITY AND IMPLEMENTATION pytximport is licensed under the GNU General Public License version 3. The source code is available at https://github.com/complextissue/pytximport and via Zenodo with DOI: 10.5281/zenodo.13907917. A related Snakemake workflow is available through GitHub at https://github.com/complextissue/snakemake-bulk-rna-seq-workflow and Zenodo with DOI: 10.5281/zenodo.12713811. Documentation and a vignette for new users are available at: https://pytximport.readthedocs.io.
Collapse
Affiliation(s)
- Malte Kuehl
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus N, Midtjylland, 8200, Denmark
- Department of Pathology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 69, Aarhus N, Midtjylland, 8200, Denmark
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Falkenried 94, Hamburg, Hamburg, 20251, Germany
- Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
| | - Milagros N Wong
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus N, Midtjylland, 8200, Denmark
- Department of Pathology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 69, Aarhus N, Midtjylland, 8200, Denmark
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
- Hamburg Center for Kidney Health, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
| | - Nicola Wanner
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
- Hamburg Center for Kidney Health, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
| | - Stefan Bonn
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Falkenried 94, Hamburg, Hamburg, 20251, Germany
- Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
| | - Victor G Puelles
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus N, Midtjylland, 8200, Denmark
- Department of Pathology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 69, Aarhus N, Midtjylland, 8200, Denmark
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
- Hamburg Center for Kidney Health, University Medical Center Hamburg-Eppendorf, Martinistraße 52, Hamburg, Hamburg, 20246, Germany
| |
Collapse
|
3
|
Goñi E, Mas AM, Gonzalez J, Abad A, Santisteban M, Fortes P, Huarte M, Hernaez M. Uncovering functional lncRNAs by scRNA-seq with ELATUS. Nat Commun 2024; 15:9709. [PMID: 39521797 PMCID: PMC11550465 DOI: 10.1038/s41467-024-54005-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open
Abstract
Long non-coding RNAs (lncRNAs) play fundamental roles in cellular processes and pathologies, regulating gene expression at multiple levels. Despite being highly cell type-specific, their study at single-cell (sc) level is challenging due to their less accurate annotation and low expression compared to protein-coding genes. Here, we systematically benchmark different preprocessing methods and develop a computational framework, named ELATUS, based on the combination of the pseudoaligner Kallisto with selective functional filtering. ELATUS enhances the detection of functional lncRNAs from scRNA-seq data, detecting their expression with higher concordance than standard methods with the ATAC-seq profiles in single-cell multiome data. Interestingly, the better results of ELATUS are due to its advanced performance with an inaccurate reference annotation such as that of lncRNAs. We independently confirm the expression patterns of cell type-specific lncRNAs exclusively detected with ELATUS and unveil biologically important lncRNAs, such as AL121895.1, a previously undocumented cis-repressor lncRNA, whose role in breast cancer progression is unnoticed by traditional methodologies. Our results emphasize the necessity for an alternative scRNA-seq workflow tailored to lncRNAs that sheds light on the multifaceted roles of lncRNAs.
Collapse
Affiliation(s)
- Enrique Goñi
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain
| | - Aina Maria Mas
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain
| | - Jovanna Gonzalez
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain
| | - Amaya Abad
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain
| | - Marta Santisteban
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain
- Department of Medical Oncology, Breast Cancer Unit, Clinica Universidad de Navarra, Pio XII 36 Ave, Pamplona, Spain
| | - Puri Fortes
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain
- Liver and Digestive Diseases Networking Biomedical Research Centre (CIBERehd), Spanish Network for Advanced Therapies (TERAV ISCIII), Madrid, Spain
| | - Maite Huarte
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain.
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain.
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain.
| | - Mikel Hernaez
- Center for Applied Medical Research, University of Navarra, PIO XII 55 Ave, Pamplona, Spain.
- Institute of Health Research of Navarra (IdiSNA), Pamplona, Spain.
- Cancer Center Clinica Universidad de Navarra (CCUN), Madrid, Spain.
- Data Science and Artificial Intelligence Institute (DATAI), Universidad de Navarra, Pamplona, Spain.
| |
Collapse
|
4
|
Chamberlin JT, Lee Y, Marth GT, Quinlan AR. Differences in molecular sampling and data processing explain variation among single-cell and single-nucleus RNA-seq experiments. Genome Res 2024; 34:179-188. [PMID: 38355308 PMCID: PMC10984380 DOI: 10.1101/gr.278253.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 02/01/2024] [Indexed: 02/16/2024]
Abstract
A mechanistic understanding of the biological and technical factors that impact transcript measurements is essential to designing and analyzing single-cell and single-nucleus RNA sequencing experiments. Nuclei contain the same pre-mRNA population as cells, but they contain a small subset of the mRNAs. Nonetheless, early studies argued that single-nucleus analysis yielded results comparable to cellular samples if pre-mRNA measurements were included. However, typical workflows do not distinguish between pre-mRNA and mRNA when estimating gene expression, and variation in their relative abundances across cell types has received limited attention. These gaps are especially important given that incorporating pre-mRNA has become commonplace for both assays, despite known gene length bias in pre-mRNA capture. Here, we reanalyze public data sets from mouse and human to describe the mechanisms and contrasting effects of mRNA and pre-mRNA sampling on gene expression and marker gene selection in single-cell and single-nucleus RNA-seq. We show that pre-mRNA levels vary considerably among cell types, which mediates the degree of gene length bias and limits the generalizability of a recently published normalization method intended to correct for this bias. As an alternative, we repurpose an existing post hoc gene length-based correction method from conventional RNA-seq gene set enrichment analysis. Finally, we show that inclusion of pre-mRNA in bioinformatic processing can impart a larger effect than assay choice itself, which is pivotal to the effective reuse of existing data. These analyses advance our understanding of the sources of variation in single-cell and single-nucleus RNA-seq experiments and provide useful guidance for future studies.
Collapse
Affiliation(s)
- John T Chamberlin
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA
| | - Younghee Lee
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA
- Seoul National University, College of Veterinary Medicine, Seoul, 08826, South Korea
| | - Gabor T Marth
- Department of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, Utah 84112, USA
| | - Aaron R Quinlan
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA;
- Department of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, Utah 84112, USA
| |
Collapse
|
5
|
He D, Gao Y, Chan SS, Quintana-Parrilla N, Patro R. Forseti: A mechanistic and predictive model of the splicing status of scRNA-seq reads. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.01.577813. [PMID: 38370848 PMCID: PMC10871212 DOI: 10.1101/2024.02.01.577813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Motivation Short-read single-cell RNA-sequencing (scRNA-seq) has been used to study cellular heterogeneity, cellular fate, and transcriptional dynamics. Modeling splicing dynamics in scRNA-seq data is challenging, with inherent difficulty in even the seemingly straightforward task of elucidating the splicing status of the molecules from which sequenced fragments are drawn. This difficulty arises, in part, from the limited read length and positional biases, which substantially reduce the specificity of the sequenced fragments. As a result, the splicing status of many reads in scRNA-seq is ambiguous because of a lack of definitive evidence. We are therefore in need of methods that can recover the splicing status of ambiguous reads which, in turn, can lead to more accuracy and confidence in downstream analyses. Results We develop Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. Our model has two key components. First, we train a binding affinity model to assign a probability that a given transcriptomic site is used in fragment generation. Second, we fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types. Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites. Using both simulated and experimental data, we show that our model can precisely predict the splicing status of reads and identify the true gene origin of multi-gene mapped reads. Availability Forseti and the code used for producing the results are available at https://github.com/COMBINE-lab/forseti under a BSD 3-clause license.
Collapse
Affiliation(s)
- Dongze He
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- Program in Computational Biology, Bioinformatics and Genomices, University of Maryland, College Park, MD 20742, USA
| | - Yuan Gao
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- Program in Computational Biology, Bioinformatics and Genomices, University of Maryland, College Park, MD 20742, USA
| | - Spencer Skylar Chan
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | | | - Rob Patro
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
6
|
He D, Mount SM, Patro R. scCensus: Off-target scRNA-seq reads reveal meaningful biology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.29.577807. [PMID: 38352549 PMCID: PMC10862729 DOI: 10.1101/2024.01.29.577807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Single-cell RNA-sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity. Although scRNA-seq reads from most prevalent and popular tagged-end protocols are expected to arise from the 3' end of polyadenylated RNAs, recent studies have shown that "off-target" reads can constitute a substantial portion of the read population. In this work, we introduced scCensus, a comprehensive analysis workflow for systematically evaluating and categorizing off-target reads in scRNA-seq. We applied scCensus to seven scRNA-seq datasets. Our analysis of intergenic reads shows that these off-target reads contain information about chromatin structure and can be used to identify similar cells across modalities. Our analysis of antisense reads suggests that these reads can be used to improve gene detection and capture interesting transcriptional activities like antisense transcription. Furthermore, using splice-aware quantification, we find that spliced and unspliced reads provide distinct information about cell clusters and biomarkers, suggesting the utility of integrating signals from reads with different splicing statuses. Overall, our results suggest that off-target scRNA-seq reads contain underappreciated information about various transcriptional activities. These observations about yet-unexploited information in existing scRNA-seq data will help guide and motivate the community to improve current algorithms and analysis methods, and to develop novel approaches that utilize off-target reads to extend the reach and accuracy of single-cell data analysis pipelines.
Collapse
Affiliation(s)
- Dongze He
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- Program in Computational Biology, Bioinformatics and Genomices, University of Maryland, College Park, MD 20742, USA
| | - Stephen M. Mount
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| | - Rob Patro
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
7
|
Gorin G, Vastola JJ, Pachter L. Studying stochastic systems biology of the cell with single-cell genomics data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.17.541250. [PMID: 37292934 PMCID: PMC10245677 DOI: 10.1101/2023.05.17.541250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.
Collapse
Affiliation(s)
- Gennady Gorin
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, 91125
| | - John J. Vastola
- Department of Neurobiology, Harvard Medical School, Boston, MA, 02115
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125
| |
Collapse
|
8
|
He D, Patro R. simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.28.534653. [PMID: 37034702 PMCID: PMC10081176 DOI: 10.1101/2023.03.28.534653] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Summary The alevin-fry ecosystem provides a robust and growing suite of programs for single-cell data processing. However, as new single-cell technologies are introduced, as the community continues to adjust best practices for data processing, and as the alevin-fry ecosystem itself expands and grows, it is becoming increasingly important to manage the complexity of alevin-fry ’s single-cell preprocessing workflows while retaining the performance and flexibility that make these tools enticing. We introduce simpleaf , a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools. Availability and implementation Simpleaf is written in Rust and released under a BSD 3-Clause license. It is freely available from its GitHub repository https://github.com/COMBINE-lab/simpleaf , and via bioconda. Documentation for simpleaf is available at https://simpleaf.readthedocs.io/en/latest/ and tutorials for simpleaf are being developed that can be accessed at https://combine-lab.github.io/alevin-fry-tutorials .
Collapse
Affiliation(s)
- Dongze He
- Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| |
Collapse
|