1
|
Leviyang S. Analysis of a Single Cell RNA-seq Workflow by Random Matrix Theory Methods. Bull Math Biol 2024; 87:4. [PMID: 39585539 DOI: 10.1007/s11538-024-01376-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Accepted: 10/18/2024] [Indexed: 11/26/2024]
Abstract
Single cell RNA-seq (scRNAseq) workflows typically start with a count matrix and end with the clustering of sampled cells. While a range of methods have been developed to cluster scRNAseq datasets, no theoretical tools exist to explain why a particular cluster exists or why a hypothesized cluster is missing. Recently, several authors have shown that eigenvalues of scRNAseq count matrices can be approximated using random matrix models. In this work, we extend these previous works to the study of a scRNAseq workflow. We model scaled count matrices using random matrices with normally distributed entries. Using these random matrix models, we quantify the differential expression of a cluster and develop predictions for the workflow, and in particular clustering, as a function of the differential expression. We also use results from random matrix theory (RMT) to develop predictive formulas for portions of the scRNAseq workflow. Using simulated and real datasets, we show that our predictions are accurate if certain conditions hold on differential expression, with our RMT based predictions requiring particularly stringent condition. We find that real datasets violate these conditions, leading to bias in our predictions, but our predictions are better than a naive estimator and we point out future work that can improve the predictions. To our knowledge, our formulas represents the first predictive results for scRNAseq workflows.
Collapse
Affiliation(s)
- Sivan Leviyang
- Department of Mathematics and Statistics, Georgetown University, Washington, 20057, DC, USA.
| |
Collapse
|
2
|
Nahman O, Few-Cooper TJ, Shen-Orr SS. Cell-specific priors rescue differential gene expression in spatial spot-based technologies. Brief Bioinform 2024; 26:bbae621. [PMID: 39679437 DOI: 10.1093/bib/bbae621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 10/18/2024] [Accepted: 11/14/2024] [Indexed: 12/17/2024] Open
Abstract
Spatial transcriptomics (ST), a breakthrough technology, captures the complex structure and state of tissues through the spatial profiling of gene expression. A variety of ST technologies have now emerged, most prominently spot-based platforms such as Visium. Despite the widespread use of ST and its distinct data characteristics, the vast majority of studies continue to analyze ST data using algorithms originally designed for older technologies such as single-cell (SC) and bulk RNA-seq-particularly when identifying differentially expressed genes (DEGs). However, it remains unclear whether these algorithms are still valid or appropriate for ST data. Therefore, here, we sought to characterize the performance of these methods by constructing an in silico simulator of ST data with a controllable and known DEG ground truth. Surprisingly, our findings reveal little variation in the performance of classic DEG algorithms-all of which fail to accurately recapture known DEGs to significant levels. We further demonstrate that cellular heterogeneity within spots is a primary cause of this poor performance and propose a simple gene-selection scheme, based on prior knowledge of cell-type specificity, to overcome this. Notably, our approach outperforms existing data-driven methods designed specifically for ST data and offers improved DEG recovery and reliability rates. In summary, our work details a conceptual framework that can be used upstream, agnostically, of any DEG algorithm to improve the accuracy of ST analysis and any downstream findings.
Collapse
Affiliation(s)
- Ornit Nahman
- Department of Immunology, Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, 1 Efron St., Haifa, 3525433, Israel
| | - Timothy J Few-Cooper
- Department of Immunology, Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, 1 Efron St., Haifa, 3525433, Israel
| | - Shai S Shen-Orr
- Department of Immunology, Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, 1 Efron St., Haifa, 3525433, Israel
| |
Collapse
|
3
|
Moon Y, Herrmann CJ, Mironov A, Zavolan M. PolyASite v3.0: a multi-species atlas of polyadenylation sites inferred from single-cell RNA-sequencing data. Nucleic Acids Res 2024:gkae1043. [PMID: 39530237 DOI: 10.1093/nar/gkae1043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Revised: 10/11/2024] [Accepted: 10/18/2024] [Indexed: 11/16/2024] Open
Abstract
The broadly used 10X Genomics technology for single-cell RNA sequencing (scRNA-seq) captures RNA 3' ends. Thus, some reads contain part of the non-templated polyadenosine tails, providing direct evidence for the sites of 3' end cleavage and polyadenylation on the respective RNAs. Taking advantage of this property, we recently developed the SCINPAS workflow to infer polyadenylation sites (PASs) from scRNA-seq data. Here, we used this workflow to construct version 3.0 (v3.0, https://polyasite.unibas.ch/) of the PolyASite Atlas from a big compendium of publicly available human, mouse and worm scRNA-seq datasets obtained from healthy tissues. As the resolution of scRNA-seq was too low for robust detection of cell-level differences in PAS usage, we aggregated samples based on their tissue-of-origin to construct tissue-level catalogs of PASs. These provide qualitatively new information about PAS usage, in comparison to the previous PAS catalogs that were based on bulk 3' end sequencing experiments primarily in cell lines. In the new version, we document stringency levels associated with each PAS so that users can balance sensitivity and specificity in their analysis. We also upgraded the integration with the UCSC Genome Browser and developed track hubs conveniently displaying pooled and tissue-specific expression of PASs.
Collapse
Affiliation(s)
- Youngbin Moon
- Computational and Systems Biology, Biozentrum University of Basel, Spitalstrasse 41, CH-4056 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Quartier Sorge, Bâtiment Amphipôle, Vaud CH-1015, Switzerland
| | - Christina J Herrmann
- Computational and Systems Biology, Biozentrum University of Basel, Spitalstrasse 41, CH-4056 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Quartier Sorge, Bâtiment Amphipôle, Vaud CH-1015, Switzerland
| | - Aleksei Mironov
- Computational and Systems Biology, Biozentrum University of Basel, Spitalstrasse 41, CH-4056 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Quartier Sorge, Bâtiment Amphipôle, Vaud CH-1015, Switzerland
| | - Mihaela Zavolan
- Computational and Systems Biology, Biozentrum University of Basel, Spitalstrasse 41, CH-4056 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Quartier Sorge, Bâtiment Amphipôle, Vaud CH-1015, Switzerland
| |
Collapse
|
4
|
Leclerc E, Pachkov M, Morisseau L, Tokito F, Legallais C, Jellali R, Nishikawa M, Abderrahmani A, Sakai Y. Investigation of the motif activity of transcription regulators in pancreatic β-like cell subpopulations differentiated from human induced pluripotent stem cells. Mol Omics 2024. [PMID: 39494575 DOI: 10.1039/d4mo00082j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2024]
Abstract
Pancreatic β-cells are composed of different subtypes that play a key role in the control of insulin secretion and thereby control glucose homeostasis. In vitro differentiation of human induced pluripotent stem cells (hiPSCs) into 3D spheroids leads to the generation of β-cell subtypes and thus to the development of islet-like structures. Using this cutting-edge cell model, the aim of the study was to decipher the signaling signature that underlines β-cell subtypes, with a focus on the search for the activity of motifs of important transcription regulators (TRs). The investigation was performed using data from previous single-cell sequencing analysis introduced into the integrated system for motif activity response analysis (ISMARA) of transcription regulators. We extracted the matrix of important TRs activated in the β-cell subpopulation and bi-hormonal-like β-cells. Based on these TRs and their targets, we built specific regulatory networks for main cell subpopulations. Our data confirmed the transcriptomic heterogeneity of the β-cell subtype lineage and suggested a mechanism that could account for the differentiation of β-cell subtypes during pancreas development. We do believe that our findings could be instrumental for understanding the mechanisms that affect the balance of β-cell subtypes, leading to impaired insulin secretion in type 2 diabetes.
Collapse
Affiliation(s)
- Eric Leclerc
- CNRS IRL 2820; Laboratory for Integrated Micro Mechatronic Systems, Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba; Meguro-ku, Tokyo, 153-8505, Japan
| | - Mikhail Pachkov
- Biozentrum, University of Basel, Spitalstrasse 41, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Lisa Morisseau
- CNRS UMR 7338, Laboratoire de Biomécanique et Bioingénierie, Sorbonne universités, Université de Technologies de Compiègne, France
| | - Fumiya Tokito
- Department of Chemical System Engineering, Graduate School of Engineering, the University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
| | - Cecile Legallais
- CNRS UMR 7338, Laboratoire de Biomécanique et Bioingénierie, Sorbonne universités, Université de Technologies de Compiègne, France
| | - Rachid Jellali
- CNRS UMR 7338, Laboratoire de Biomécanique et Bioingénierie, Sorbonne universités, Université de Technologies de Compiègne, France
| | - Masaki Nishikawa
- Department of Chemical System Engineering, Graduate School of Engineering, the University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
| | - Amar Abderrahmani
- Univ. Lille, CNRS, Centrale Lille, Univ. Polytechnique Hauts-de-France, UMR 8520, IEMN, F-59000 Lille, France
| | - Yasuyuki Sakai
- CNRS IRL 2820; Laboratory for Integrated Micro Mechatronic Systems, Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba; Meguro-ku, Tokyo, 153-8505, Japan
- Department of Chemical System Engineering, Graduate School of Engineering, the University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
| |
Collapse
|
5
|
Wang W, Wang Y, Lyu R, Grün D. Scalable identification of lineage-specific gene regulatory networks from metacells with NetID. Genome Biol 2024; 25:275. [PMID: 39425176 PMCID: PMC11488259 DOI: 10.1186/s13059-024-03418-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 10/08/2024] [Indexed: 10/21/2024] Open
Abstract
The identification of gene regulatory networks (GRNs) is crucial for understanding cellular differentiation. Single-cell RNA sequencing data encode gene-level covariations at high resolution, yet data sparsity and high dimensionality hamper accurate and scalable GRN reconstruction. To overcome these challenges, we introduce NetID leveraging homogenous metacells while avoiding spurious gene-gene correlations. Benchmarking demonstrates superior performance of NetID compared to imputation-based methods. By incorporating cell fate probability information, NetID facilitates the prediction of lineage-specific GRNs and recovers known network motifs governing bone marrow hematopoiesis, making it a powerful toolkit for deciphering gene regulatory control of cellular differentiation from large-scale single-cell transcriptome data.
Collapse
Affiliation(s)
- Weixu Wang
- Human Phenome Institute, Fudan University, Shanghai, China
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
| | - Yichen Wang
- Cancer, Ageing and Somatic Mutation, Wellcome Sanger Institute, Hinxton, UK
| | - Ruiqi Lyu
- School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Dominic Grün
- Würzburg Institute of Systems Immunology, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.
- CAIDAS - Center for Artificial Intelligence and Data Science, Würzburg, Germany.
| |
Collapse
|
6
|
Skinner DJ, Lemaire P, Mani M. Physical modeling of embryonic transcriptomes identifies collective modes of gene expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.26.605398. [PMID: 39131269 PMCID: PMC11312445 DOI: 10.1101/2024.07.26.605398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Starting from one totipotent cell, complex multicellular organisms form through a series of differentiation and morphogenetic events, culminating in a multitude of cell types arranged in a functional and intricate spatial pattern. To do so, cells coordinate with each other, resulting in dynamics which follow a precise developmental trajectory, constraining the space of possible embryo-to-embryo variation. Using recent single-cell sequencing data of early ascidian embryos, we leverage natural variation together with modeling and inference techniques from statistical physics to investigate development at the level of a complete interconnected embryo - an embryonic transcriptome. After developing a robust and biophysically motivated approach to identifying distinct transcriptomic states or cell types, a statistical analysis reveals correlations within embryos and across cell types demonstrating the presence of collective variation. From these intra-embryo correlations, we infer minimal networks of cell-cell interactions, which reveal the collective modes of gene expression. Our work demonstrates how the existence and nature of spatial interactions along with the collective modes of expression that they give rise to can be inferred from single-cell gene expression measurements, opening up a wider range of biological questions that can be addressed using sequencing-based modalities.
Collapse
Affiliation(s)
- Dominic J. Skinner
- Center for Computational Biology, Flatiron Institute, 162 5th Ave, New York, NY 10010, USA
- NSF-Simons Center for Quantitative Biology, Northwestern University, 2205 Tech Drive, Evanston, IL 60208, USA
| | - Patrick Lemaire
- CRBM, Université de Montpellier, CNRS, 34293 Montpellier, France
| | - Madhav Mani
- NSF-Simons Center for Quantitative Biology, Northwestern University, 2205 Tech Drive, Evanston, IL 60208, USA
- Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL 60208, USA
| |
Collapse
|
7
|
Lause J, Ziegenhain C, Hartmanis L, Berens P, Kobak D. Compound models and Pearson residuals for single-cell RNA-seq data without UMIs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.08.02.551637. [PMID: 37577688 PMCID: PMC10418209 DOI: 10.1101/2023.08.02.551637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Recent work employed Pearson residuals from Poisson or negative binomial models to normalize UMI data. To extend this approach to non-UMI data, we model the additional amplification step with a compound distribution: we assume that sequenced RNA molecules follow a negative binomial distribution, and are then replicated following an amplification distribution. We show how this model leads to compound Pearson residuals, which yield meaningful gene selection and embeddings of Smart-seq2 datasets. Further, we suggest that amplification distributions across several sequencing protocols can be described by a broken power law. The resulting compound model captures previously unexplained overdispersion and zero-inflation patterns in non-UMI data.
Collapse
|
8
|
Sun F, Li H, Sun D, Fu S, Gu L, Shao X, Wang Q, Dong X, Duan B, Xing F, Wu J, Xiao M, Zhao F, Han JDJ, Liu Q, Fan X, Li C, Wang C, Shi T. Single-cell omics: experimental workflow, data analyses and applications. SCIENCE CHINA. LIFE SCIENCES 2024:10.1007/s11427-023-2561-0. [PMID: 39060615 DOI: 10.1007/s11427-023-2561-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 04/18/2024] [Indexed: 07/28/2024]
Abstract
Cells are the fundamental units of biological systems and exhibit unique development trajectories and molecular features. Our exploration of how the genomes orchestrate the formation and maintenance of each cell, and control the cellular phenotypes of various organismsis, is both captivating and intricate. Since the inception of the first single-cell RNA technology, technologies related to single-cell sequencing have experienced rapid advancements in recent years. These technologies have expanded horizontally to include single-cell genome, epigenome, proteome, and metabolome, while vertically, they have progressed to integrate multiple omics data and incorporate additional information such as spatial scRNA-seq and CRISPR screening. Single-cell omics represent a groundbreaking advancement in the biomedical field, offering profound insights into the understanding of complex diseases, including cancers. Here, we comprehensively summarize recent advances in single-cell omics technologies, with a specific focus on the methodology section. This overview aims to guide researchers in selecting appropriate methods for single-cell sequencing and related data analysis.
Collapse
Affiliation(s)
- Fengying Sun
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China
| | - Haoyan Li
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Dongqing Sun
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Shaliu Fu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Lei Gu
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China
| | - Qinqin Wang
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Dong
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Bin Duan
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Feiyang Xing
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Jun Wu
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Minmin Xiao
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Jing-Dong J Han
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China.
| | - Qi Liu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China.
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China.
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China.
- Zhejiang Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310006, China.
| | - Chen Li
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Chenfei Wang
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China.
| | - Tieliu Shi
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
9
|
Gondal MN, Shah SUR, Chinnaiyan AM, Cieslik M. A systematic overview of single-cell transcriptomics databases, their use cases, and limitations. FRONTIERS IN BIOINFORMATICS 2024; 4:1417428. [PMID: 39040140 PMCID: PMC11260681 DOI: 10.3389/fbinf.2024.1417428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Accepted: 06/11/2024] [Indexed: 07/24/2024] Open
Abstract
Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq) technologies and experimental protocols have led to the generation of vast amounts of transcriptomic data that populates several online databases and repositories. Here, we systematically examined large-scale scRNA-seq databases, categorizing them based on their scope and purpose such as general, tissue-specific databases, disease-specific databases, cancer-focused databases, and cell type-focused databases. Next, we discuss the technical and methodological challenges associated with curating large-scale scRNA-seq databases, along with current computational solutions. We argue that understanding scRNA-seq databases, including their limitations and assumptions, is crucial for effectively utilizing this data to make robust discoveries and identify novel biological insights. Such platforms can help bridge the gap between computational and wet lab scientists through user-friendly web-based interfaces needed for democratizing access to single-cell data. These platforms would facilitate interdisciplinary research, enabling researchers from various disciplines to collaborate effectively. This review underscores the importance of leveraging computational approaches to unravel the complexities of single-cell data and offers a promising direction for future research in the field.
Collapse
Affiliation(s)
- Mahnoor N. Gondal
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States
- Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, MI, United States
| | - Saad Ur Rehman Shah
- Gies College of Business, University of Illinois Business College, Champaign, MI, United States
| | - Arul M. Chinnaiyan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States
- Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, MI, United States
- Department of Pathology, University of Michigan, Ann Arbor, MI, United States
- Department of Urology, University of Michigan, Ann Arbor, MI, United States
- Howard Hughes Medical Institute, Ann Arbor, MI, United States
- University of Michigan Rogel Cancer Center, Ann Arbor, MI, United States
| | - Marcin Cieslik
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States
- Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, MI, United States
- Department of Pathology, University of Michigan, Ann Arbor, MI, United States
- University of Michigan Rogel Cancer Center, Ann Arbor, MI, United States
| |
Collapse
|
10
|
Schupp PG, Shelton SJ, Brody DJ, Eliscu R, Johnson BE, Mazor T, Kelley KW, Potts MB, McDermott MW, Huang EJ, Lim DA, Pieper RO, Berger MS, Costello JF, Phillips JJ, Oldham MC. Deconstructing Intratumoral Heterogeneity through Multiomic and Multiscale Analysis of Serial Sections. Cancers (Basel) 2024; 16:2429. [PMID: 39001492 PMCID: PMC11240479 DOI: 10.3390/cancers16132429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 06/27/2024] [Accepted: 06/28/2024] [Indexed: 07/16/2024] Open
Abstract
Tumors may contain billions of cells, including distinct malignant clones and nonmalignant cell types. Clarifying the evolutionary histories, prevalence, and defining molecular features of these cells is essential for improving clinical outcomes, since intratumoral heterogeneity provides fuel for acquired resistance to targeted therapies. Here we present a statistically motivated strategy for deconstructing intratumoral heterogeneity through multiomic and multiscale analysis of serial tumor sections (MOMA). By combining deep sampling of IDH-mutant astrocytomas with integrative analysis of single-nucleotide variants, copy-number variants, and gene expression, we reconstruct and validate the phylogenies, spatial distributions, and transcriptional profiles of distinct malignant clones. By genotyping nuclei analyzed by single-nucleus RNA-seq for truncal mutations, we further show that commonly used algorithms for identifying cancer cells from single-cell transcriptomes may be inaccurate. We also demonstrate that correlating gene expression with tumor purity in bulk samples can reveal optimal markers of malignant cells and use this approach to identify a core set of genes that are consistently expressed by astrocytoma truncal clones, including AKR1C3, whose expression is associated with poor outcomes in several types of cancer. In summary, MOMA provides a robust and flexible strategy for precisely deconstructing intratumoral heterogeneity and clarifying the core molecular properties of distinct cellular populations in solid tumors.
Collapse
Affiliation(s)
- Patrick G. Schupp
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Samuel J. Shelton
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Daniel J. Brody
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Rebecca Eliscu
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Brett E. Johnson
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Tali Mazor
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Kevin W. Kelley
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
- Medical Scientist Training Program, University of California, San Francisco, San Francisco, CA 94143, USA
- Neuroscience Graduate Program, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Matthew B. Potts
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Michael W. McDermott
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Eric J. Huang
- Department of Pathology, University of California, San Francisco, San Francisco, CA 94143, USA;
| | - Daniel A. Lim
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Russell O. Pieper
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Mitchel S. Berger
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Joseph F. Costello
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| | - Joanna J. Phillips
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
- Department of Pathology, University of California, San Francisco, San Francisco, CA 94143, USA;
| | - Michael C. Oldham
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA 94143, USA; (P.G.S.); (S.J.S.); (D.J.B.); (R.E.); (B.E.J.); (T.M.); (K.W.K.); (M.B.P.); (M.W.M.); (D.A.L.); (R.O.P.); (M.S.B.); (J.F.C.); (J.J.P.)
| |
Collapse
|
11
|
Schupp PG, Shelton SJ, Brody DJ, Eliscu R, Johnson BE, Mazor T, Kelley KW, Potts MB, McDermott MW, Huang EJ, Lim DA, Pieper RO, Berger MS, Costello JF, Phillips JJ, Oldham MC. Deconstructing intratumoral heterogeneity through multiomic and multiscale analysis of serial sections. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.21.545365. [PMID: 37645893 PMCID: PMC10461981 DOI: 10.1101/2023.06.21.545365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Tumors may contain billions of cells including distinct malignant clones and nonmalignant cell types. Clarifying the evolutionary histories, prevalence, and defining molecular features of these cells is essential for improving clinical outcomes, since intratumoral heterogeneity provides fuel for acquired resistance to targeted therapies. Here we present a statistically motivated strategy for deconstructing intratumoral heterogeneity through multiomic and multiscale analysis of serial tumor sections (MOMA). By combining deep sampling of IDH-mutant astrocytomas with integrative analysis of single-nucleotide variants, copy-number variants, and gene expression, we reconstruct and validate the phylogenies, spatial distributions, and transcriptional profiles of distinct malignant clones. By genotyping nuclei analyzed by single-nucleus RNA-seq for truncal mutations, we further show that commonly used algorithms for identifying cancer cells from single-cell transcriptomes may be inaccurate. We also demonstrate that correlating gene expression with tumor purity in bulk samples can reveal optimal markers of malignant cells and use this approach to identify a core set of genes that is consistently expressed by astrocytoma truncal clones, including AKR1C3, whose expression is associated with poor outcomes in several types of cancer. In summary, MOMA provides a robust and flexible strategy for precisely deconstructing intratumoral heterogeneity and clarifying the core molecular properties of distinct cellular populations in solid tumors.
Collapse
Affiliation(s)
- Patrick G. Schupp
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
- Biomedical Sciences Graduate Program, University of California San Francisco, San Francisco, California, USA
| | - Samuel J. Shelton
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Daniel J. Brody
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Rebecca Eliscu
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Brett E. Johnson
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Tali Mazor
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
- Biomedical Sciences Graduate Program, University of California San Francisco, San Francisco, California, USA
- Medical Scientist Training Program and Neuroscience Graduate Program, University of California San Francisco, San Francisco, California, USA
| | - Kevin W. Kelley
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
- Medical Scientist Training Program and Neuroscience Graduate Program, University of California San Francisco, San Francisco, California, USA
- Neuroscience Graduate Program, University of California San Francisco, San Francisco, California, USA
| | - Matthew B. Potts
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Michael W. McDermott
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Eric J. Huang
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Daniel A. Lim
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Russell O. Pieper
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Mitchel S. Berger
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Joseph F. Costello
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| | - Joanna J. Phillips
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
- Department of Pathology, University of California, San Francisco, San Francisco, California, USA
| | - Michael C. Oldham
- Department of Neurological Surgery, University of California, San Francisco, San Francisco,California, USA
| |
Collapse
|
12
|
Fourneaux C, Racine L, Koering C, Dussurgey S, Vallin E, Moussy A, Parmentier R, Brunard F, Stockholm D, Modolo L, Picard F, Gandrillon O, Paldi A, Gonin-Giraud S. Differentiation is accompanied by a progressive loss in transcriptional memory. BMC Biol 2024; 22:58. [PMID: 38468285 PMCID: PMC10929117 DOI: 10.1186/s12915-024-01846-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 02/13/2024] [Indexed: 03/13/2024] Open
Abstract
BACKGROUND Cell differentiation requires the integration of two opposite processes, a stabilizing cellular memory, especially at the transcriptional scale, and a burst of gene expression variability which follows the differentiation induction. Therefore, the actual capacity of a cell to undergo phenotypic change during a differentiation process relies upon a modification in this balance which favors change-inducing gene expression variability. However, there are no experimental data providing insight on how fast the transcriptomes of identical cells would diverge on the scale of the very first two cell divisions during the differentiation process. RESULTS In order to quantitatively address this question, we developed different experimental methods to recover the transcriptomes of related cells, after one and two divisions, while preserving the information about their lineage at the scale of a single cell division. We analyzed the transcriptomes of related cells from two differentiation biological systems (human CD34+ cells and T2EC chicken primary erythrocytic progenitors) using two different single-cell transcriptomics technologies (scRT-qPCR and scRNA-seq). CONCLUSIONS We identified that the gene transcription profiles of differentiating sister cells are more similar to each other than to those of non-related cells of the same type, sharing the same environment and undergoing similar biological processes. More importantly, we observed greater discrepancies between differentiating sister cells than between self-renewing sister cells. Furthermore, a progressive increase in this divergence from first generation to second generation was observed when comparing differentiating cousin cells to self renewing cousin cells. Our results are in favor of a gradual erasure of transcriptional memory during the differentiation process.
Collapse
Affiliation(s)
- Camille Fourneaux
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - Laëtitia Racine
- Ecole Pratique des Hautes Etudes, PSL Research University, Sorbonne Université, INSERM, CRSA, Paris, 75012, France
| | - Catherine Koering
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - Sébastien Dussurgey
- Plateforme AniRA-Cytométrie, Université Claude Bernard Lyon 1, CNRS UAR3444, Inserm US8, ENS de Lyon, SFR Biosciences, Lyon, F-69007, France
| | - Elodie Vallin
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - Alice Moussy
- Ecole Pratique des Hautes Etudes, PSL Research University, Sorbonne Université, INSERM, CRSA, Paris, 75012, France
| | - Romuald Parmentier
- Ecole Pratique des Hautes Etudes, PSL Research University, Sorbonne Université, INSERM, CRSA, Paris, 75012, France
| | - Fanny Brunard
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - Daniel Stockholm
- Ecole Pratique des Hautes Etudes, PSL Research University, Sorbonne Université, INSERM, CRSA, Paris, 75012, France
| | - Laurent Modolo
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - Franck Picard
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - Olivier Gandrillon
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
- Inria Center, Grenoble Rhone-Alpes, Equipe Dracula, Villeurbanne, F69100, France
| | - Andras Paldi
- Ecole Pratique des Hautes Etudes, PSL Research University, Sorbonne Université, INSERM, CRSA, Paris, 75012, France
| | - Sandrine Gonin-Giraud
- Laboratoire de Biologie et Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France.
| |
Collapse
|
13
|
Aslan Kamil M, Fourneaux C, Yilmaz A, Stavros S, Parmentier R, Paldi A, Gonin-Giraud S, deMello AJ, Gandrillon O. An image-guided microfluidic system for single-cell lineage tracking. PLoS One 2023; 18:e0288655. [PMID: 37527253 PMCID: PMC10393162 DOI: 10.1371/journal.pone.0288655] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 06/30/2023] [Indexed: 08/03/2023] Open
Abstract
Cell lineage tracking is a long-standing and unresolved problem in biology. Microfluidic technologies have the potential to address this problem, by virtue of their ability to manipulate and process single-cells in a rapid, controllable and efficient manner. Indeed, when coupled with traditional imaging approaches, microfluidic systems allow the experimentalist to follow single-cell divisions over time. Herein, we present a valve-based microfluidic system able to probe the decision-making processes of single-cells, by tracking their lineage over multiple generations. The system operates by trapping single-cells within growth chambers, allowing the trapped cells to grow and divide, isolating sister cells after a user-defined number of divisions and finally extracting them for downstream transcriptome analysis. The platform incorporates multiple cell manipulation operations, image processing-based automation for cell loading and growth monitoring, reagent addition and device washing. To demonstrate the efficacy of the microfluidic workflow, 6C2 (chicken erythroleukemia) and T2EC (primary chicken erythrocytic progenitors) cells are tracked inside the microfluidic device over two generations, with a cell viability rate in excess of 90%. Sister cells are successfully isolated after division and extracted within a 500 nL volume, which was demonstrated to be compatible with downstream single-cell RNA sequencing analysis.
Collapse
Affiliation(s)
- Mahmut Aslan Kamil
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Zürich, Switzerland
| | - Camille Fourneaux
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard, Lyon, France
| | | | - Stavrakis Stavros
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Zürich, Switzerland
| | - Romuald Parmentier
- Ecole Pratique des Hautes Etudes, St-Antoine Research Center, Inserm U938, PSL Research University, Paris, France
| | - Andras Paldi
- Ecole Pratique des Hautes Etudes, St-Antoine Research Center, Inserm U938, PSL Research University, Paris, France
| | - Sandrine Gonin-Giraud
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard, Lyon, France
| | - Andrew J deMello
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Zürich, Switzerland
| | - Olivier Gandrillon
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard, Lyon, France
- Inria, France
| |
Collapse
|
14
|
Xiong G, Bekiranov S, Zhang A. ProtoCell4P: an explainable prototype-based neural network for patient classification using single-cell RNA-seq. Bioinformatics 2023; 39:btad493. [PMID: 37540223 PMCID: PMC10444962 DOI: 10.1093/bioinformatics/btad493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 07/09/2023] [Accepted: 08/03/2023] [Indexed: 08/05/2023] Open
Abstract
MOTIVATION The rapid advance in single-cell RNA sequencing (scRNA-seq) technology over the past decade has provided a rich resource of gene expression profiles of single cells measured on patients, facilitating the study of many biological questions at the single-cell level. One intriguing research is to study the single cells which play critical roles in the phenotypes of patients, which has the potential to identify those cells and genes driving the disease phenotypes. To this end, deep learning models are expected to well encode the single-cell information and achieve precise prediction of patients' phenotypes using scRNA-seq data. However, we are facing critical challenges in designing deep learning models for classifying patient samples due to (i) the samples collected in the same dataset contain a variable number of cells-some samples might only have hundreds of cells sequenced while others could have thousands of cells, and (ii) the number of samples available is typically small and the expression profile of each cell is noisy and extremely high-dimensional. Moreover, the black-box nature of existing deep learning models makes it difficult for the researchers to interpret the models and extract useful knowledge from them. RESULTS We propose a prototype-based and cell-informed model for patient phenotype classification, termed ProtoCell4P, that can alleviate problems of the sample scarcity and the diverse number of cells by leveraging the cell knowledge with representatives of cells (called prototypes), and precisely classify the patients by adaptively incorporating information from different cells. Moreover, this classification process can be explicitly interpreted by identifying the key cells for decision making and by further summarizing the knowledge of cell types to unravel the biological nature of the classification. Our approach is explainable at the single-cell resolution which can identify the key cells in each patient's classification. The experimental results demonstrate that our proposed method can effectively deal with patient classifications using single-cell data and outperforms the existing approaches. Furthermore, our approach is able to uncover the association between cell types and biological classes of interest from a data-driven perspective. AVAILABILITY AND IMPLEMENTATION https://github.com/Teddy-XiongGZ/ProtoCell4P.
Collapse
Affiliation(s)
- Guangzhi Xiong
- Department of Computer Science, University of Virginia, Charlottesville, VA, United States
| | - Stefan Bekiranov
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
15
|
Jiménez S, Schreiber V, Mercier R, Gradwohl G, Molina N. Characterization of cell-fate decision landscapes by estimating transcription factor dynamics. CELL REPORTS METHODS 2023; 3:100512. [PMID: 37533652 PMCID: PMC10391345 DOI: 10.1016/j.crmeth.2023.100512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 03/23/2023] [Accepted: 06/01/2023] [Indexed: 08/04/2023]
Abstract
Time-specific modulation of gene expression during differentiation by transcription factors promotes cell diversity. However, estimating their dynamic regulatory activity at the single-cell level and in a high-throughput manner remains challenging. We present FateCompass, an integrative approach that utilizes single-cell transcriptomics data to identify lineage-specific transcription factors throughout differentiation. By combining a probabilistic framework with RNA velocities or differentiation potential, we estimate transition probabilities, while a linear model of gene regulation is employed to compute transcription factor activities. Considering dynamic changes and correlations of expression and activities, FateCompass identifies lineage-specific regulators. Our validation using in silico data and application to pancreatic endocrine cell differentiation datasets highlight both known and potentially novel lineage-specific regulators. Notably, we uncovered undescribed transcription factors of an enterochromaffin-like population during in vitro differentiation toward ß-like cells. FateCompass provides a valuable framework for hypothesis generation, advancing our understanding of the gene regulatory networks driving cell-fate decisions.
Collapse
Affiliation(s)
- Sara Jiménez
- Université de Strasbourg, Strasbourg, France
- CNRS, UMR 7104, 67400 Illkirch, France
- INSERM, UMR-S 1258, 67400 Illkirch, France
- IGBMC, Institut de Génétique et de Biologie Moléculaire et Cellulaire, 67400 Illkirch, France
| | - Valérie Schreiber
- Université de Strasbourg, Strasbourg, France
- CNRS, UMR 7104, 67400 Illkirch, France
- INSERM, UMR-S 1258, 67400 Illkirch, France
- IGBMC, Institut de Génétique et de Biologie Moléculaire et Cellulaire, 67400 Illkirch, France
| | - Reuben Mercier
- Université de Strasbourg, Strasbourg, France
- CNRS, UMR 7104, 67400 Illkirch, France
- INSERM, UMR-S 1258, 67400 Illkirch, France
- IGBMC, Institut de Génétique et de Biologie Moléculaire et Cellulaire, 67400 Illkirch, France
| | - Gérard Gradwohl
- Université de Strasbourg, Strasbourg, France
- CNRS, UMR 7104, 67400 Illkirch, France
- INSERM, UMR-S 1258, 67400 Illkirch, France
- IGBMC, Institut de Génétique et de Biologie Moléculaire et Cellulaire, 67400 Illkirch, France
| | - Nacho Molina
- Université de Strasbourg, Strasbourg, France
- CNRS, UMR 7104, 67400 Illkirch, France
- INSERM, UMR-S 1258, 67400 Illkirch, France
- IGBMC, Institut de Génétique et de Biologie Moléculaire et Cellulaire, 67400 Illkirch, France
| |
Collapse
|
16
|
Shahin M, Ji B, Dixit PD. EMBED: Essential MicroBiomE Dynamics, a dimensionality reduction approach for longitudinal microbiome studies. NPJ Syst Biol Appl 2023; 9:26. [PMID: 37339950 PMCID: PMC10282069 DOI: 10.1038/s41540-023-00285-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 05/23/2023] [Indexed: 06/22/2023] Open
Abstract
Dimensionality reduction offers unique insights into high-dimensional microbiome dynamics by leveraging collective abundance fluctuations of multiple bacteria driven by similar ecological perturbations. However, methods providing lower-dimensional representations of microbiome dynamics both at the community and individual taxa levels are not currently available. To that end, we present EMBED: Essential MicroBiomE Dynamics, a probabilistic nonlinear tensor factorization approach. Like normal mode analysis in structural biophysics, EMBED infers ecological normal modes (ECNs), which represent the unique orthogonal modes capturing the collective behavior of microbial communities. Using multiple real and synthetic datasets, we show that a very small number of ECNs can accurately approximate microbiome dynamics. Inferred ECNs reflect specific ecological behaviors, providing natural templates along which the dynamics of individual bacteria may be partitioned. Moreover, the multi-subject treatment in EMBED systematically identifies subject-specific and universal abundance dynamics that are not detected by traditional approaches. Collectively, these results highlight the utility of EMBED as a versatile dimensionality reduction tool for studies of microbiome dynamics.
Collapse
Affiliation(s)
- Mayar Shahin
- Department of Physics, University of Florida, Gainesville, FL, 32611, USA.
| | - Brian Ji
- Physician-Scientist Training Pathway, Department of Medicine, UCSD, San Diego, CA, 92103, USA
| | - Purushottam D Dixit
- Department of Physics, University of Florida, Gainesville, FL, 32611, USA.
- Genetics Institute, University of Florida, Gainesville, FL, 32611, USA.
- Department of Chemical Engineering, University of Florida, Gainesville, FL, 32611, USA.
- Department of Biomedical Engineering, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
17
|
Adler M, Moriel N, Goeva A, Avraham-Davidi I, Mages S, Adams TS, Kaminski N, Macosko EZ, Regev A, Medzhitov R, Nitzan M. Emergence of division of labor in tissues through cell interactions and spatial cues. Cell Rep 2023; 42:112412. [PMID: 37086403 PMCID: PMC10242439 DOI: 10.1016/j.celrep.2023.112412] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 01/26/2023] [Accepted: 04/03/2023] [Indexed: 04/23/2023] Open
Abstract
Most cell types in multicellular organisms can perform multiple functions. However, not all functions can be optimally performed simultaneously by the same cells. Functions incompatible at the level of individual cells can be performed at the cell population level, where cells divide labor and specialize in different functions. Division of labor can arise due to instruction by tissue environment or through self-organization. Here, we develop a computational framework to investigate the contribution of these mechanisms to division of labor within a cell-type population. By optimizing collective cellular task performance under trade-offs, we find that distinguishable expression patterns can emerge from cell-cell interactions versus instructive signals. We propose a method to construct ligand-receptor networks between specialist cells and use it to infer division-of-labor mechanisms from single-cell RNA sequencing (RNA-seq) and spatial transcriptomics data of stromal, epithelial, and immune cells. Our framework can be used to characterize the complexity of cell interactions within tissues.
Collapse
Affiliation(s)
- Miri Adler
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA; Tananbaum Center for Theoretical and Analytical Human Biology, Yale University School of Medicine, New Haven, CT, USA
| | - Noa Moriel
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Aleksandrina Goeva
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA
| | - Inbal Avraham-Davidi
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA
| | - Simon Mages
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA; Gene Center and Department of Biochemistry, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Taylor S Adams
- Section of Pulmonary, Critical Care and Sleep Medicine, Yale University School of Medicine, New Haven, CT, USA
| | - Naftali Kaminski
- Section of Pulmonary, Critical Care and Sleep Medicine, Yale University School of Medicine, New Haven, CT, USA
| | - Evan Z Macosko
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA; Massachusetts General Hospital, Department of Psychiatry, Boston, MA, USA
| | - Aviv Regev
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA; Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Ruslan Medzhitov
- Tananbaum Center for Theoretical and Analytical Human Biology, Yale University School of Medicine, New Haven, CT, USA; Howard Hughes Medical Institute, Department of Immunobiology, Yale University School of Medicine, New Haven, CT, USA.
| | - Mor Nitzan
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel; Racah Institute of Physics, The Hebrew University of Jerusalem, Jerusalem, Israel; Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel.
| |
Collapse
|
18
|
Vo HD, Forero-Quintero LS, Aguilera LU, Munsky B. Analysis and design of single-cell experiments to harvest fluctuation information while rejecting measurement noise. Front Cell Dev Biol 2023; 11:1133994. [PMID: 37305680 PMCID: PMC10250612 DOI: 10.3389/fcell.2023.1133994] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Accepted: 05/10/2023] [Indexed: 06/13/2023] Open
Abstract
Introduction: Despite continued technological improvements, measurement errors always reduce or distort the information that any real experiment can provide to quantify cellular dynamics. This problem is particularly serious for cell signaling studies to quantify heterogeneity in single-cell gene regulation, where important RNA and protein copy numbers are themselves subject to the inherently random fluctuations of biochemical reactions. Until now, it has not been clear how measurement noise should be managed in addition to other experiment design variables (e.g., sampling size, measurement times, or perturbation levels) to ensure that collected data will provide useful insights on signaling or gene expression mechanisms of interest. Methods: We propose a computational framework that takes explicit consideration of measurement errors to analyze single-cell observations, and we derive Fisher Information Matrix (FIM)-based criteria to quantify the information value of distorted experiments. Results and Discussion: We apply this framework to analyze multiple models in the context of simulated and experimental single-cell data for a reporter gene controlled by an HIV promoter. We show that the proposed approach quantitatively predicts how different types of measurement distortions affect the accuracy and precision of model identification, and we demonstrate that the effects of these distortions can be mitigated through explicit consideration during model inference. We conclude that this reformulation of the FIM could be used effectively to design single-cell experiments to optimally harvest fluctuation information while mitigating the effects of image distortion.
Collapse
Affiliation(s)
- Huy D. Vo
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO, United States
| | - Linda S. Forero-Quintero
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO, United States
| | - Luis U. Aguilera
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO, United States
| | - Brian Munsky
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO, United States
- School of Biomedical Engineering, Colorado State University, Fort Collins, CO, United States
| |
Collapse
|
19
|
Ahlmann-Eltze C, Huber W. Comparison of transformations for single-cell RNA-seq data. Nat Methods 2023; 20:665-672. [PMID: 37037999 PMCID: PMC10172138 DOI: 10.1038/s41592-023-01814-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 02/11/2023] [Indexed: 04/12/2023]
Abstract
The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state and factor analysis. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties; however, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal-component analysis, performs as well or better than the more sophisticated alternatives. This result highlights limitations of current theoretical analysis as assessed by bottom-line performance benchmarks.
Collapse
Affiliation(s)
- Constantin Ahlmann-Eltze
- Genome Biology Unit, EMBL, Heidelberg, Germany.
- Faculty of Biosciences, Heidelberg University, Heidelberg, Germany.
| | | |
Collapse
|
20
|
Lazzardi S, Valle F, Mazzolini A, Scialdone A, Caselle M, Osella M. Emergent statistical laws in single-cell transcriptomic data. Phys Rev E 2023; 107:044403. [PMID: 37198814 DOI: 10.1103/physreve.107.044403] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/24/2023] [Indexed: 05/19/2023]
Abstract
Large-scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology, or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.
Collapse
Affiliation(s)
- Silvia Lazzardi
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Filippo Valle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Andrea Mazzolini
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Antonio Scialdone
- Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Feodor-Lynen-Straße 21, 81377 München, Germany and Institute of Functional Epigenetics and Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany
| | - Michele Caselle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Matteo Osella
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| |
Collapse
|
21
|
Islam MT, Xing L. Cartography of Genomic Interactions Enables Deep Analysis of Single-Cell Expression Data. Nat Commun 2023; 14:679. [PMID: 36755047 PMCID: PMC9908983 DOI: 10.1038/s41467-023-36383-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 01/30/2023] [Indexed: 02/10/2023] Open
Abstract
Remarkable advances in single cell genomics have presented unique challenges and opportunities for interrogating a wealth of biomedical inquiries. High dimensional genomic data are inherently complex because of intertwined relationships among the genes. Existing methods, including emerging deep learning-based approaches, do not consider the underlying biological characteristics during data processing, which greatly compromises the performance of data analysis and hinders the maximal utilization of state-of-the-art genomic techniques. In this work, we develop an entropy-based cartography strategy to contrive the high dimensional gene expression data into a configured image format, referred to as genomap, with explicit integration of the genomic interactions. This unique cartography casts the gene-gene interactions into the spatial configuration of genomaps and enables us to extract the deep genomic interaction features and discover underlying discriminative patterns of the data. We show that, for a wide variety of applications (cell clustering and recognition, gene signature extraction, single cell data integration, cellular trajectory analysis, dimensionality reduction, and visualization), the proposed approach drastically improves the accuracies of data analyses as compared to the state-of-the-art techniques.
Collapse
Affiliation(s)
- Md Tauhidul Islam
- Department of Radiation Oncology, Stanford University, Stanford, California, 94305, USA
| | - Lei Xing
- Department of Radiation Oncology, Stanford University, Stanford, California, 94305, USA.
| |
Collapse
|
22
|
Tomkins M, Hoerbst F, Gupta S, Apelt F, Kehr J, Kragler F, Morris RJ. Exact Bayesian inference for the detection of graft-mobile transcripts from sequencing data. J R Soc Interface 2022; 19:20220644. [PMID: 36514890 PMCID: PMC9748499 DOI: 10.1098/rsif.2022.0644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 11/24/2022] [Indexed: 12/15/2022] Open
Abstract
The long-distance transport of messenger RNAs (mRNAs) has been shown to be important for several developmental processes in plants. A popular method for identifying travelling mRNAs is to perform RNA-Seq on grafted plants. This approach depends on the ability to correctly assign sequenced mRNAs to the genetic background from which they originated. The assignment is often based on the identification of single-nucleotide polymorphisms (SNPs) between otherwise identical sequences. A major challenge is therefore to distinguish SNPs from sequencing errors. Here, we show how Bayes factors can be computed analytically using RNA-Seq data over all the SNPs in an mRNA. We used simulations to evaluate the performance of the proposed framework and demonstrate how Bayes factors accurately identify graft-mobile transcripts. The comparison with other detection methods using simulated data shows how not taking the variability in read depth, error rates and multiple SNPs per transcript into account can lead to incorrect classification. Our results suggest experimental design criteria for successful graft-mobile mRNA detection and show the pitfalls of filtering for sequencing errors or focusing on single SNPs within an mRNA.
Collapse
Affiliation(s)
- Melissa Tomkins
- Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR47UH, UK
| | - Franziska Hoerbst
- Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR47UH, UK
| | - Saurabh Gupta
- Max Planck Institute of Molecular Plant Physiology, Max Planck Institute, Am Mühlenberg 1, Potsdam-Golm 14476, Germany
| | - Federico Apelt
- Max Planck Institute of Molecular Plant Physiology, Max Planck Institute, Am Mühlenberg 1, Potsdam-Golm 14476, Germany
| | - Julia Kehr
- Institute of Plant Science and Microbiology, Universität Hamburg, Ohnhorststrasse 18, Hamburg 22609, Germany
| | - Friedrich Kragler
- Max Planck Institute of Molecular Plant Physiology, Max Planck Institute, Am Mühlenberg 1, Potsdam-Golm 14476, Germany
| | - Richard J. Morris
- Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR47UH, UK
| |
Collapse
|
23
|
Costa-Silva J, Domingues DS, Menotti D, Hungria M, Lopes FM. Temporal progress of gene expression analysis with RNA-Seq data: A review on the relationship between computational methods. Comput Struct Biotechnol J 2022; 21:86-98. [PMID: 36514333 PMCID: PMC9730150 DOI: 10.1016/j.csbj.2022.11.051] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 11/25/2022] [Accepted: 11/25/2022] [Indexed: 12/03/2022] Open
Abstract
Analysis of differential gene expression from RNA-seq data has become a standard for several research areas. The steps for the computational analysis include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of the differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step, and their properties, therefore introducing an organized overview to this context. This review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-seq), considering the computational methods. In addition, a timeline of the computational methods for DEG is shown and discussed, and the relationships existing between the most important computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review. This paper will serve as a tutorial for new entrants into the field and help established users update their analysis pipelines.
Collapse
Affiliation(s)
- Juliana Costa-Silva
- Department of Informatics – Federal University of Paraná, Rua Coronel Francisco Heráclito dos Santos, 100, 81531-990 Curitiba, Paraná, Brazil
| | - Douglas S. Domingues
- Department of Genetics, “Luiz de Queiroz” College of Agriculture, University of São Paulo, Av. Pádua Dias, 11, 13418-900 Piracicaba, São Paulo, Brazil
| | - David Menotti
- Department of Informatics – Federal University of Paraná, Rua Coronel Francisco Heráclito dos Santos, 100, 81531-990 Curitiba, Paraná, Brazil
| | - Mariangela Hungria
- Department of Soil Biotecnology - Embrapa Soybean, Cx. Postal 231, 86000-970 Londrina, Paraná, Brazil
| | - Fabrício Martins Lopes
- Department of Computer Science, Universidade Tecnológica Federal do Paraná – UTFPR, Av. Alberto Carazzai, 1640, 86300-000, Cornélio Procópio, Paraná, Brazil
| |
Collapse
|
24
|
Lan T, Hutvagner G, Zhang X, Liu T, Wong L, Li J. Density-based detection of cell transition states to construct disparate and bifurcating trajectories. Nucleic Acids Res 2022; 50:e122. [PMID: 36124665 PMCID: PMC9757071 DOI: 10.1093/nar/gkac785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/22/2022] [Accepted: 09/01/2022] [Indexed: 12/24/2022] Open
Abstract
Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmental biologies and can be also inferred through computational methods from single-cell RNA-sequencing datasets. However, trajectories with complicated topologies such as loops, disparate lineages and bifurcating hierarchy remain difficult to infer accurately. Here, we introduce a density-based trajectory inference method capable of constructing diverse shapes of topological patterns including the most intriguing bifurcations. The novelty of our method is a step to exploit overlapping probability distributions to identify transition states of cells for determining connectability between cell clusters, and another step to infer a stable trajectory through a base-topology guided iterative fitting. Our method precisely re-constructed various benchmark reference trajectories. As a case study to demonstrate practical usefulness, our method was tested on single-cell RNA sequencing profiles of blood cells of SARS-CoV-2-infected patients. We not only re-discovered the linear trajectory bridging the transition from IgM plasmablast cells to developing neutrophils, and also found a previously-undiscovered lineage which can be rigorously supported by differentially expressed gene analysis.
Collapse
Affiliation(s)
- Tian Lan
- Data Science Institute and School of Computer Science, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Xuan Zhang
- Data Science Institute and School of Computer Science, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Tao Liu
- Children’s Cancer Institute Australia for Medical Research, Randwick, NSW 2031, Australia
| | - Limsoon Wong
- School of Computing, National University of Singapore, 13 Computing Drive, 117417, Singapore
| | - Jinyan Li
- Data Science Institute and School of Computer Science, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
25
|
Qi J, Sheng Q, Zhou Y, Hua J, Xiao S, Jin S. scMTD: a statistical multidimensional imputation method for single-cell RNA-seq data leveraging transcriptome dynamic information. Cell Biosci 2022; 12:142. [PMID: 36056412 PMCID: PMC9440561 DOI: 10.1186/s13578-022-00886-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/17/2022] [Indexed: 11/17/2022] Open
Abstract
Background Single-cell RNA sequencing (scRNA-seq) provides a powerful tool to capture transcriptomes at single-cell resolution. However, dropout events distort the gene expression levels and underlying biological signals, misleading the downstream analysis of scRNA-seq data. Results We develop a statistical model-based multidimensional imputation algorithm, scMTD, that identifies local cell neighbors and specific gene co-expression networks based on the pseudo-time of cells, leveraging information on cell-level, gene-level, and transcriptome dynamic to recover scRNA-seq data. Compared with the state-of-the-art imputation methods through several real-data-based analytical experiments, scMTD effectively recovers biological signals of transcriptomes and consistently outperforms the other algorithms in improving FISH validation, trajectory inference, differential expression analysis, clustering analysis, and identification of cell types. Conclusions scMTD maintains the gene expression characteristics, enhances the clustering of cell subpopulations, assists the study of gene expression dynamics, contributes to the discovery of rare cell types, and applies to both UMI-based and non-UMI-based data. Overall, scMTD’s reliability, applicability, and scalability make it a promising imputation approach for scRNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s13578-022-00886-4.
Collapse
|
26
|
Treppner M, Binder H, Hess M. Interpretable generative deep learning: an illustration with single cell gene expression data. Hum Genet 2022; 141:1481-1498. [PMID: 34988661 PMCID: PMC9360114 DOI: 10.1007/s00439-021-02417-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/06/2021] [Indexed: 11/26/2022]
Abstract
Deep generative models can learn the underlying structure, such as pathways or gene programs, from omics data. We provide an introduction as well as an overview of such techniques, specifically illustrating their use with single-cell gene expression data. For example, the low dimensional latent representations offered by various approaches, such as variational auto-encoders, are useful to get a better understanding of the relations between observed gene expressions and experimental factors or phenotypes. Furthermore, by providing a generative model for the latent and observed variables, deep generative models can generate synthetic observations, which allow us to assess the uncertainty in the learned representations. While deep generative models are useful to learn the structure of high-dimensional omics data by efficiently capturing non-linear dependencies between genes, they are sometimes difficult to interpret due to their neural network building blocks. More precisely, to understand the relationship between learned latent variables and observed variables, e.g., gene transcript abundances and external phenotypes, is difficult. Therefore, we also illustrate current approaches that allow us to infer the relationship between learned latent variables and observed variables as well as external phenotypes. Thereby, we render deep learning approaches more interpretable. In an application with single-cell gene expression data, we demonstrate the utility of the discussed methods.
Collapse
Affiliation(s)
- Martin Treppner
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Str. 26, Freiburg, 79104, Germany.
| | - Harald Binder
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| | - Moritz Hess
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| |
Collapse
|
27
|
Gorin G, Fang M, Chari T, Pachter L. RNA velocity unraveled. PLoS Comput Biol 2022; 18:e1010492. [PMID: 36094956 PMCID: PMC9499228 DOI: 10.1371/journal.pcbi.1010492] [Citation(s) in RCA: 58] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 09/22/2022] [Accepted: 08/14/2022] [Indexed: 11/24/2022] Open
Abstract
We perform a thorough analysis of RNA velocity methods, with a view towards understanding the suitability of the various assumptions underlying popular implementations. In addition to providing a self-contained exposition of the underlying mathematics, we undertake simulations and perform controlled experiments on biological datasets to assess workflow sensitivity to parameter choices and underlying biology. Finally, we argue for a more rigorous approach to RNA velocity, and present a framework for Markovian analysis that points to directions for improvement and mitigation of current problems.
Collapse
Affiliation(s)
- Gennady Gorin
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, United States of America
| | - Meichen Fang
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, United States of America
| | - Tara Chari
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, United States of America
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, United States of America
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, California, United States of America
| |
Collapse
|
28
|
Breda J, Banerjee A, Jayachandran R, Pieters J, Zavolan M. A novel approach to single-cell analysis reveals intrinsic differences in immune marker expression in unstimulated BALB/c and C57BL/6 macrophages. FEBS Lett 2022; 596:2630-2643. [PMID: 36001069 DOI: 10.1002/1873-3468.14478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Revised: 08/12/2022] [Accepted: 08/16/2022] [Indexed: 11/06/2022]
Abstract
The origin of functional heterogeneity among macrophages, key innate immune system components, is still debated. While mouse strains differ in their immune responses, the range of gene expression variation among their pre-stimulation macrophages is unknown. With a novel approach to scRNA-seq analysis, we reveal the gene expression variation in unstimulated macrophage populations from BALB/c and C57BL/6 mice. We show that intrinsic strain-to-strain differences are detectable before stimulation and we place the unstimulated single cells within the gene expression landscape of stimulated macrophages. C57BL/6 mice show stronger evidence of macrophage polarization than BALB/c mice, which may contribute to their relative resistance to pathogens. Our computational methods can be generally adopted to uncover biological variation between cell populations.
Collapse
Affiliation(s)
- Jeremie Breda
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Arka Banerjee
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | | | - Jean Pieters
- Biozentrum, University of Basel, Basel, Switzerland
| | - Mihaela Zavolan
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
29
|
Lasri A, Shahrezaei V, Sturrock M. Benchmarking imputation methods for network inference using a novel method of synthetic scRNA-seq data generation. BMC Bioinformatics 2022; 23:236. [PMID: 35715748 PMCID: PMC9204969 DOI: 10.1186/s12859-022-04778-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 05/31/2022] [Indexed: 11/30/2022] Open
Abstract
Background Single cell RNA-sequencing (scRNA-seq) has very rapidly become the new workhorse of modern biology providing an unprecedented global view on cellular diversity and heterogeneity. In particular, the structure of gene-gene expression correlation contains information on the underlying gene regulatory networks. However, interpretation of scRNA-seq data is challenging due to specific experimental error and biases that are unique to this kind of data including drop-out (or technical zeros). Methods To deal with this problem several methods for imputation of zeros for scRNA-seq have been developed. However, it is not clear how these processing steps affect inference of genetic networks from single cell data. Here, we introduce Biomodelling.jl, a tool for generation of synthetic scRNA-seq data using multiscale modelling of stochastic gene regulatory networks in growing and dividing cells. Results Our tool produces realistic transcription data with a known ground truth network topology that can be used to benchmark different approaches for gene regulatory network inference. Using this tool we investigate the impact of different imputation methods on the performance of several network inference algorithms. Conclusions Biomodelling.jl provides a versatile and useful tool for future development and benchmarking of network inference approaches using scRNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04778-9
Collapse
Affiliation(s)
- Ayoub Lasri
- Department of Physiology and Medical Physics, Royal College of Surgeons in Ireland, Dublin, Ireland
| | - Vahid Shahrezaei
- Department of Mathematics, Faculty of Natural Sciences, Imperial College London, London, SW7 2AZ, UK
| | - Marc Sturrock
- Department of Physiology and Medical Physics, Royal College of Surgeons in Ireland, Dublin, Ireland.
| |
Collapse
|
30
|
Abstract
High-throughput sequencing for B cell receptor (BCR) repertoire provides useful insights for the adaptive immune system. With the continuous development of the BCR-seq technology, many efforts have been made to develop methods for analyzing the ever-increasing BCR repertoire data. In this review, we comprehensively outline different BCR repertoire library preparation protocols and summarize three major steps of BCR-seq data analysis, i. e., V(D)J sequence annotation, clonal phylogenetic inference, and BCR repertoire profiling and mining. Different from other reviews in this field, we emphasize background intuition and the statistical principle of each method to help biologists better understand it. Finally, we discuss data mining problems for BCR-seq data and with a highlight on recently emerging multiple-sample analysis.
Collapse
|
31
|
Are batch effects still relevant in the age of big data? Trends Biotechnol 2022; 40:1029-1040. [DOI: 10.1016/j.tibtech.2022.02.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 02/13/2022] [Accepted: 02/18/2022] [Indexed: 12/30/2022]
|
32
|
Schwabe D, Falcke M. On the relation between input and output distributions of scRNA-seq experiments. Bioinformatics 2022; 38:1336-1343. [PMID: 34908126 DOI: 10.1093/bioinformatics/btab841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 12/01/2021] [Accepted: 12/12/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing determines RNA copy numbers per cell for a given gene. However, technical noise poses the question how observed distributions (output) are connected to their cellular distributions (input). RESULTS We model a single-cell RNA sequencing setup consisting of PCR amplification and sequencing, and derive probability distribution functions for the output distribution given an input distribution. We provide copy number distributions arising from single transcripts during PCR amplification with exact expressions for mean and variance. We prove that the coefficient of variation of the output of sequencing is always larger than that of the input distribution. Experimental data reveals the variance and mean of the input distribution to obey characteristic relations, which we specifically determine for a HeLa dataset. We can calculate as many moments of the input distribution as are known of the output distribution (up to all). This, in principle, completely determines the input from the output distribution. AVAILABILITY AND IMPLEMENTATION Source code freely available at https://github.com/danielschw188/InputOutputSCRNASeq. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daniel Schwabe
- Mathematical Cell Physiology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125 Berlin, Germany
| | - Martin Falcke
- Mathematical Cell Physiology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125 Berlin, Germany.,Department of Physics, Humboldt University Berlin, 12489 Berlin, Germany
| |
Collapse
|
33
|
Lopez-Delisle L, Delisle JB. baredSC: Bayesian approach to retrieve expression distribution of single-cell data. BMC Bioinformatics 2022; 23:36. [PMID: 35021985 PMCID: PMC8756634 DOI: 10.1186/s12859-021-04507-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Accepted: 11/30/2021] [Indexed: 12/02/2022] Open
Abstract
Background The number of studies using single-cell RNA sequencing (scRNA-seq) is constantly growing. This powerful technique provides a sampling of the whole transcriptome of a cell. However, sparsity of the data can be a major hurdle when studying the distribution of the expression of a specific gene or the correlation between the expressions of two genes. Results We show that the main technical noise associated with these scRNA-seq experiments is due to the sampling, i.e., Poisson noise. We present a new tool named baredSC, for Bayesian Approach to Retrieve Expression Distribution of Single-Cell data, which infers the intrinsic expression distribution in scRNA-seq data using a Gaussian mixture model. baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes’ expressions. We apply baredSC to simulated scRNA-seq data and show that the algorithm is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data. We also apply baredSC to two real biological data sets. First, we use it to measure the anti-correlation between Hoxd13 and Hoxa11, two genes with known genetic interaction in embryonic limb. Then, we study the expression of Pitx1 in embryonic hindlimb, for which a trimodal distribution has been identified through flow cytometry. While other methods to analyze scRNA-seq are too sensitive to sampling noise, baredSC reveals this trimodal distribution. Conclusion baredSC is a powerful tool which aims at retrieving the expression distribution of few genes of interest from scRNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04507-8.
Collapse
|
34
|
Burri D, Zavolan M. Shortening of 3' UTRs in most cell types composing tumor tissues implicates alternative polyadenylation in protein metabolism. RNA (NEW YORK, N.Y.) 2021; 27:1459-1470. [PMID: 34521731 PMCID: PMC8594477 DOI: 10.1261/rna.078886.121] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 08/24/2021] [Indexed: 05/18/2023]
Abstract
During pre-mRNA maturation 3' end processing can occur at different polyadenylation sites in the 3' untranslated region (3' UTR) to give rise to transcript isoforms that differ in the length of their 3' UTRs. Longer 3' UTRs contain additional cis-regulatory elements that impact the fate of the transcript and/or of the resulting protein. Extensive alternative polyadenylation (APA) has been observed in cancers, but the mechanisms and roles remain elusive. In particular, it is unclear whether the APA occurs in the malignant cells or in other cell types that infiltrate the tumor. To resolve this, we developed a computational method, called SCUREL, that quantifies changes in 3' UTR length between groups of cells, including cells of the same type originating from tumor and control tissue. We used this method to study APA in human lung adenocarcinoma (LUAD). SCUREL relies solely on annotated 3' UTRs and on control systems such as T cell activation, and spermatogenesis gives qualitatively similar results at much greater sensitivity compared to the previously published scAPA method. In the LUAD samples, we find a general trend toward 3' UTR shortening not only in cancer cells compared to the cell type of origin, but also when comparing other cell types from the tumor vs. the control tissue environment. However, we also find high variability in the individual targets between patients. The findings help in understanding the extent and impact of APA in LUAD, which may support improvements in diagnosis and treatment.
Collapse
Affiliation(s)
- Dominik Burri
- Computational and Systems Biology, Biozentrum, University of Basel, Basel, CH-4056, Switzerland SIB Swiss Institute of Bioinformatics, Basel, CH-4056, Switzerland
| | - Mihaela Zavolan
- Computational and Systems Biology, Biozentrum, University of Basel, Basel, CH-4056, Switzerland SIB Swiss Institute of Bioinformatics, Basel, CH-4056, Switzerland
| |
Collapse
|
35
|
Lause J, Berens P, Kobak D. Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data. Genome Biol 2021; 22:258. [PMID: 34488842 PMCID: PMC8419999 DOI: 10.1186/s13059-021-02451-7] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 08/02/2021] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. RESULTS We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. CONCLUSIONS We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.
Collapse
Affiliation(s)
- Jan Lause
- University of Tübingen, Institute for Ophthalmic Research, Tübingen, Germany
| | - Philipp Berens
- University of Tübingen, Institute for Ophthalmic Research, Tübingen, Germany
- University of Tübingen, Institute for Bioinformatics and Medical Informatics, Tübingen, Germany
- University of Tübingen, Bernstein Center for Computational Neuroscience, Tübingen, Germany
- University of Tübingen, Center for Integrative Neuroscience, Tübingen, Germany
| | - Dmitry Kobak
- University of Tübingen, Institute for Ophthalmic Research, Tübingen, Germany
| |
Collapse
|