1
|
Crowell HL, Morillo Leonardo SX, Soneson C, Robinson MD. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol 2023; 24:62. [PMID: 36991470 PMCID: PMC10061781 DOI: 10.1186/s13059-023-02904-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/20/2023] [Indexed: 03/31/2023] Open
Abstract
BACKGROUND With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Collapse
Affiliation(s)
- Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | | | - Charlotte Soneson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Current address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
| |
Collapse
|
2
|
Laczik M, Erdős E, Ozgyin L, Hevessy Z, Csősz É, Kalló G, Nagy T, Barta E, Póliska S, Szatmári I, Bálint BL. Extensive proteome and functional genomic profiling of variability between genetically identical human B-lymphoblastoid cells. Sci Data 2022; 9:763. [PMID: 36496436 PMCID: PMC9741606 DOI: 10.1038/s41597-022-01871-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Accepted: 11/22/2022] [Indexed: 12/13/2022] Open
Abstract
In life-science research isogenic B-lymphoblastoid cell lines (LCLs) are widely known and preferred for their genetic stability - they are often used for studying mutations for example, where genetic stability is crucial. We have shown previously that phenotypic variability can be observed in isogenic B-lymphoblastoid cell lines. Isogenic LCLs present well-defined phenotypic differences on various levels, for example on the gene expression level or the chromatin level. Based on our investigations, the phenotypic variability of the isogenic LCLs is accompanied by certain genetic variation too. We have developed a compendium of LCL datasets that present the phenotypic and genetic variability of five isogenic LCLs from a multiomic perspective. In this paper, we present additional datasets generated with Next Generation Sequencing techniques to provide genomic and transcriptomic profiles (WGS, RNA-seq, single cell RNA-seq), protein-DNA interactions (ChIP-seq), together with mass spectrometry and flow cytometry datasets to monitor the changes in the proteome. We are sharing these datasets with the scientific community according to the FAIR principles for further investigations.
Collapse
Affiliation(s)
- Miklós Laczik
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Edina Erdős
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Lilla Ozgyin
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Zsuzsanna Hevessy
- grid.7122.60000 0001 1088 8582Department of Laboratory Medicine, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Éva Csősz
- grid.7122.60000 0001 1088 8582Proteomics Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Gergő Kalló
- grid.7122.60000 0001 1088 8582Proteomics Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Tibor Nagy
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary ,grid.129553.90000 0001 1015 7851Department of Genetics and Genomics, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert út 4, Gödöllő, H-2100 Hungary
| | - Endre Barta
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary ,grid.129553.90000 0001 1015 7851Department of Genetics and Genomics, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert út 4, Gödöllő, H-2100 Hungary
| | - Szilárd Póliska
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - István Szatmári
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary ,grid.7122.60000 0001 1088 8582Faculty of Pharmacy, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary
| | - Bálint László Bálint
- grid.7122.60000 0001 1088 8582Genomic Medicine and Bioinformatic Core Facility, Department of Biochemistry and Molecular Biology, Faculty of Medicine, University of Debrecen, Debrecen, Egyetem tér 1., H-4032 Hungary ,grid.11804.3c0000 0001 0942 9821Department of Bioinformatics, Semmelweis University, Budapest, Tűzoltó utca 7-9., H-1094 Hungary
| |
Collapse
|
3
|
Mallick H, Chatterjee S, Chowdhury S, Chatterjee S, Rahnavard A, Hicks SC. Differential expression of single-cell RNA-seq data using Tweedie models. Stat Med 2022; 41:3492-3510. [PMID: 35656596 PMCID: PMC9288986 DOI: 10.1002/sim.9430] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Revised: 04/21/2022] [Accepted: 04/22/2022] [Indexed: 12/13/2022]
Abstract
The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.
Collapse
Affiliation(s)
- Himel Mallick
- Biostatistics and Research Decision Sciences, Merck &
Co., Inc., Rahway, NJ 07065, USA
| | - Suvo Chatterjee
- Epidemiology Branch, Division of Intramural Population
Health Research, Eunice Kennedy Shriver National Institute of Child
Health and Human Development, National Institutes of Health, Bethesda, MD 20892,
USA
| | - Shrabanti Chowdhury
- Department of Genetics and Genomic Sciences and Icahn
Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount
Sinai, New York, NY 10029, USA
| | - Saptarshi Chatterjee
- Department of Statistics, Data and Analytics, Eli Lilly
& Company, Indianapolis, IN 46225, USA
| | - Ali Rahnavard
- Computational Biology Institute, Department of
Biostatistics and Bioinformatics, Milken Institute School of Public Health, The
George Washington University, Washington, DC 20052, USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School
of Public Health, Baltimore, MD 21205, USA
| |
Collapse
|