1
|
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Leveraging gene correlations in single cell transcriptomic data. BMC Bioinformatics 2024; 25:305. [PMID: 39294560 PMCID: PMC11411778 DOI: 10.1186/s12859-024-05926-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 09/09/2024] [Indexed: 09/20/2024] Open
Abstract
BACKGROUND Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). RESULTS We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. CONCLUSIONS New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.
Collapse
Affiliation(s)
- Kai Silkwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Emmanuel Dollinger
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Department of Mathematics, University of California, Irvine, Irvine, CA, USA
| | - Joshua Gervin
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Scott Atwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Qing Nie
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Department of Mathematics, University of California, Irvine, Irvine, CA, USA
| | - Arthur D Lander
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
| |
Collapse
|
2
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
3
|
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Leveraging gene correlations in single cell transcriptomic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.14.532643. [PMID: 36993765 PMCID: PMC10055147 DOI: 10.1101/2023.03.14.532643] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
BACKGROUND Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data when ground truth about biological variation is unknown (i.e., usually). RESULTS We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate p-values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. CONCLUSIONS New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.
Collapse
Affiliation(s)
- Kai Silkwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| | - Emmanuel Dollinger
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
- Department of Mathematics, University of California, Irvine, Irvine CA
| | - Josh Gervin
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| | - Scott Atwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| | - Qing Nie
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
- Department of Mathematics, University of California, Irvine, Irvine CA
| | - Arthur D. Lander
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| |
Collapse
|
4
|
Das S, Rai A, Rai SN. Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges. ENTROPY 2022; 24:e24070995. [PMID: 35885218 PMCID: PMC9315519 DOI: 10.3390/e24070995] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/25/2022] [Accepted: 07/09/2022] [Indexed: 01/11/2023]
Abstract
With the advent of single-cell RNA-sequencing (scRNA-seq), it is possible to measure the expression dynamics of genes at the single-cell level. Through scRNA-seq, a huge amount of expression data for several thousand(s) of genes over million(s) of cells are generated in a single experiment. Differential expression analysis is the primary downstream analysis of such data to identify gene markers for cell type detection and also provide inputs to other secondary analyses. Many statistical approaches for differential expression analysis have been reported in the literature. Therefore, we critically discuss the underlying statistical principles of the approaches and distinctly divide them into six major classes, i.e., generalized linear, generalized additive, Hurdle, mixture models, two-class parametric, and non-parametric approaches. We also succinctly discuss the limitations that are specific to each class of approaches, and how they are addressed by other subsequent classes of approach. A number of challenges are identified in this study that must be addressed to develop the next class of innovative approaches. Furthermore, we also emphasize the methodological challenges involved in differential expression analysis of scRNA-seq data that researchers must address to draw maximum benefit from this recent single-cell technology. This study will serve as a guide to genome researchers and experimental biologists to objectively select options for their analysis.
Collapse
Affiliation(s)
- Samarendra Das
- ICAR-Directorate of Foot and Mouth Disease, Arugul, Bhubaneswar 752050, India
- International Centre for Foot and Mouth Disease, Arugul, Bhubaneswar 752050, India
- Correspondence: or (S.D.); (S.N.R.)
| | - Anil Rai
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India;
| | - Shesh N. Rai
- School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY 40292, USA
- Biostatistics and Bioinformatics Facility, Brown Cancer Center, University of Louisville, Louisville, KY 40202, USA
- Biostatisitcs and Informatics Facility, Center for Integrative Environmental Health Sciences, University of Louisville, Louisville, KY 40202, USA
- Data Analysis and Sample Management Facility, The University of Louisville Super Fund Center, University of Louisville, Louisville, KY 40202, USA
- Hepatobiology and Toxicology Center, University of Louisville, Louisville, KY 40202, USA
- Christina Lee Brown Envirome Institute, University of Louisville, Louisville, KY 40202, USA
- Correspondence: or (S.D.); (S.N.R.)
| |
Collapse
|