1
|
Liu X, Wang H, Gao J. scIALM: A method for sparse scRNA-seq expression matrix imputation using the Inexact Augmented Lagrange Multiplier with low error. Comput Struct Biotechnol J 2024; 23:549-558. [PMID: 38274995 PMCID: PMC10809077 DOI: 10.1016/j.csbj.2023.12.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 12/21/2023] [Accepted: 12/22/2023] [Indexed: 01/27/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a high-throughput sequencing technology that quantifies gene expression profiles of specific cell populations at the single-cell level, providing a foundation for studying cellular heterogeneity and patient pathological characteristics. It is effective for developmental, fertility, and disease studies. However, the cell-gene expression matrix of single-cell sequencing data is often sparse and contains numerous zero values. Some of the zero values derive from noise, where dropout noise has a large impact on downstream analysis. In this paper, we propose a method named scIALM for imputation recovery of sparse single-cell RNA data expression matrices, which employs the Inexact Augmented Lagrange Multiplier method to use sparse but clean (accurate) data to recover unknown entries in the matrix. We perform experimental analysis on four datasets, calling the expression matrix after Quality Control (QC) as the original matrix, and comparing the performance of scIALM with six other methods using mean squared error (MSE), mean absolute error (MAE), Pearson correlation coefficient (PCC), and cosine similarity (CS). Our results demonstrate that scIALM accurately recovers the original data of the matrix with an error of 10e-4, and the mean value of the four metrics reaches 4.5072 (MSE), 0.765 (MAE), 0.8701 (PCC), 0.8896 (CS). In addition, at 10%-50% random masking noise, scIALM is the least sensitive to the masking ratio. For downstream analysis, this study uses adjusted rand index (ARI) and normalized mutual information (NMI) to evaluate the clustering effect, and the results are improved on three datasets containing real cluster labels.
Collapse
Affiliation(s)
- Xiaohong Liu
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Han Wang
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Jingyang Gao
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| |
Collapse
|
2
|
Petrany A, Chen R, Zhang S, Chen Y. Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data. Genome Res 2024; 34:1636-1650. [PMID: 39406498 PMCID: PMC11529838 DOI: 10.1101/gr.278843.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Accepted: 09/10/2024] [Indexed: 11/01/2024]
Abstract
High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the P-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.
Collapse
Affiliation(s)
- Alicia Petrany
- Department of Biological and Biomedical Sciences, Rowan University, Glassboro, New Jersey 08028, USA
| | - Ruoyu Chen
- Moorestown High School, Moorestown, New Jersey 08057, USA
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Yong Chen
- Department of Biological and Biomedical Sciences, Rowan University, Glassboro, New Jersey 08028, USA;
| |
Collapse
|
3
|
Dollinger E, Silkwood K, Atwood S, Nie Q, Lander AD. Statistically principled feature selection for single cell transcriptomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.11.617709. [PMID: 39463971 PMCID: PMC11507810 DOI: 10.1101/2024.10.11.617709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
The high dimensionality of data in single cell transcriptomics (scRNAseq) requires investigators to choose subsets of genes (feature selection) for downstream analysis (e.g., unsupervised cell clustering). The evaluation of different approaches to feature selection is hampered by the fact that, as we show here, the performance of feature selection methods varies greatly with the task being performed. For routine cell type identification, even randomly chosen features can perform well, but for cell type differences that are subtle, both number of features and selection strategy can matter strongly. Here we present a simple feature selection method grounded in an analytical model that, without resorting to arbitrary thresholds or user-defined parameters, allows for interpretable delineation of both how many and which features to choose, facilitating identification of biologically meaningful rare cell types. We compare this method to default methods in scanpy and Seurat, as well as SCTransform, showing how greater accuracy can often be achieved with surprisingly few, well-chosen features.
Collapse
Affiliation(s)
- Emmanuel Dollinger
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697
| | - Kai Silkwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697
| | - Scott Atwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697
| | - Qing Nie
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697
- Department of Mathematics, University of California, Irvine, Irvine, CA 92697
| | - Arthur D. Lander
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697
| |
Collapse
|
4
|
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Leveraging gene correlations in single cell transcriptomic data. BMC Bioinformatics 2024; 25:305. [PMID: 39294560 PMCID: PMC11411778 DOI: 10.1186/s12859-024-05926-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 09/09/2024] [Indexed: 09/20/2024] Open
Abstract
BACKGROUND Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). RESULTS We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. CONCLUSIONS New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.
Collapse
Affiliation(s)
- Kai Silkwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Emmanuel Dollinger
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Department of Mathematics, University of California, Irvine, Irvine, CA, USA
| | - Joshua Gervin
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Scott Atwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Qing Nie
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Department of Mathematics, University of California, Irvine, Irvine, CA, USA
| | - Arthur D Lander
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
| |
Collapse
|
5
|
Jin W, Pei J, Roy JR, Jayaraman S, Ahalliya RM, Kanniappan GV, Mironescu M, Palanisamy CP. Comprehensive review on single-cell RNA sequencing: A new frontier in Alzheimer's disease research. Ageing Res Rev 2024; 100:102454. [PMID: 39142391 DOI: 10.1016/j.arr.2024.102454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 08/07/2024] [Accepted: 08/09/2024] [Indexed: 08/16/2024]
Abstract
Alzheimer's disease (AD) is a multifaceted neurodegenerative condition marked by gradual cognitive deterioration and the loss of neurons. While conventional bulk RNA sequencing techniques have shed light on AD pathology, they frequently obscure the cellular diversity within brain tissues. The advent of single-cell RNA sequencing (scRNA-seq) has transformed our capability to analyze the cellular composition of AD, allowing for the detection of unique cell populations, rare cell types, and gene expression alterations at an individual cell level. This review examines the use of scRNA-seq in AD research, focusing on its contributions to understanding cellular diversity, disease progression, and potential therapeutic targets. We discuss key technological innovations, data analysis techniques, and challenges associated with scRNA-seq in studying AD. Furthermore, we highlight recent studies that have utilized scRNA-seq to identify novel biomarkers, uncover disease-associated pathways, and elucidate the role of non-neuronal cells, such as microglia and astrocytes, in AD pathogenesis. By providing a comprehensive overview of advancements in scRNA-seq for unraveling cellular heterogeneity in AD, this review highlights the transformative impact of scRNA-seq on our comprehension of disease mechanisms and the creation of targeted treatments.
Collapse
Affiliation(s)
- Wengang Jin
- Qinba State Key Laboratory of Biological Resources and Ecological Environment, 2011 QinLing-Bashan Mountains Bioresources Comprehensive Development C. I. C, Shaanxi Province Key Laboratory of Bio-Resources, College of Bioscience and Bioengineering, Shaanxi University of Technology, Hanzhong 723001, China
| | - JinJin Pei
- Qinba State Key Laboratory of Biological Resources and Ecological Environment, 2011 QinLing-Bashan Mountains Bioresources Comprehensive Development C. I. C, Shaanxi Province Key Laboratory of Bio-Resources, College of Bioscience and Bioengineering, Shaanxi University of Technology, Hanzhong 723001, China
| | - Jeane Rebecca Roy
- Department of Anatomy, Bhaarath Medical College and hospital, Bharath Institute of Higher Education and Research (BIHER), Chennai, Tamil Nadu 600073, India
| | - Selvaraj Jayaraman
- Centre of Molecular Medicine and Diagnostics (COMManD), Department of Biochemistry, Saveetha Dental College & Hospital, Saveetha Institute of Medical & Technical Sciences, Saveetha University, Chennai 600077, India
| | - Rathi Muthaiyan Ahalliya
- Department of Biochemistry and Cancer Research Centre, FASCM, Karpagam Academy of Higher Education, Coimbatore, Tamil Nadu 641021, India
| | - Gopalakrishnan Velliyur Kanniappan
- Center for Global Health Research, Saveetha Medical College & Hospital, Saveetha Institute of Medical and Technical Sciences (SIMATS), Thandalam, Chennai, Tamil Nadu 602105, India.
| | - Monica Mironescu
- Faculty of Agricultural Sciences Food Industry and Environmental Protection, Lucian Blaga University of Sibiu, Bv. Victoriei 10, Sibiu 550024, Romania.
| | - Chella Perumal Palanisamy
- Department of Chemical Technology, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand.
| |
Collapse
|
6
|
Biswas B, Kumar N, Sugimoto M, Hoque MA. scHD4E: Novel ensemble learning-based differential expression analysis method for single-cell RNA-sequencing data. Comput Biol Med 2024; 178:108769. [PMID: 38897145 DOI: 10.1016/j.compbiomed.2024.108769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 05/14/2024] [Accepted: 06/15/2024] [Indexed: 06/21/2024]
Abstract
Differential expression (DE) analysis between cell types for scRNA-seq data by capturing its complicated features is crucial. Recently, different methods have been developed for targeting the scRNA-seq data analysis based on different modeling frameworks, assumptions, strategies and test statistic in considering various data features. The scDEA is an ensemble learning-based DE analysis method developed recently, yielding p-values using Lancaster's combination, generated by 12 individual DE analysis methods, and producing more accurate and stable results than individual methods. The objective of our study is to propose a new ensemble learning-based DE analysis method, scHD4E, using top performers in only 4 separate methods. The top performer 4 methods have been selected through an evaluation process using six real scRNA-seq data sets. We conducted comprehensive experiments for five experimental data sets to evaluate our proposed method based on the sample size effects, batch effects, type I error control, gene ontology enrichment analysis, runtime, identified matched DE genes, and semantic similarity measurement between methods. We also perform similar analyses (except the last 3 terms) and compute performance measures like accuracy, F1 score, Mathew's correlation coefficient etc. for a simulated data set. The results show that scHD4E is performs better than all the individual and scDEA methods in all the above perspectives. We expect that scHD4E will serve the modern data scientists for detecting the DEGs in scRNA-seq data analysis. To implement our proposed method, a Github R package scHD4E and its shiny application has been developed, and available in the following links: https://github.com/bbiswas1989/scHD4E and https://github.com/bbiswas1989/scHD4E-Shiny.
Collapse
Affiliation(s)
- Biplab Biswas
- Department of Statistics, Faculty of Science, Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj, 8100, Bangladesh; Department of Statistics, Faculty of Science, University of Rajshahi, Rajshahi, 6205, Bangladesh.
| | - Nishith Kumar
- Department of Statistics, Faculty of Science, Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj, 8100, Bangladesh.
| | - Masahiro Sugimoto
- Institute for Advanced Biosciences, Keio University 246-2 Mizukami, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan.
| | - Md Aminul Hoque
- Department of Statistics, Faculty of Science, University of Rajshahi, Rajshahi, 6205, Bangladesh.
| |
Collapse
|
7
|
Wu CH, Zhou X, Chen M. The curses of performing differential expression analysis using single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.28.596315. [PMID: 38853843 PMCID: PMC11160624 DOI: 10.1101/2024.05.28.596315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Differential expression analysis is pivotal in single-cell transcriptomics for unraveling cell-type- specific responses to stimuli. While numerous methods are available to identify differentially expressed genes in single-cell data, recent evaluations of both single-cell-specific methods and methods adapted from bulk studies have revealed significant shortcomings in performance. In this paper, we dissect the four major challenges in single-cell DE analysis: normalization, excessive zeros, donor effects, and cumulative biases. These "curses" underscore the limitations and conceptual pitfalls in existing workflows. In response, we introduce a novel paradigm addressing several of these issues.
Collapse
|
8
|
Ozier-Lafontaine A, Fourneaux C, Durif G, Arsenteva P, Vallot C, Gandrillon O, Gonin-Giraud S, Michel B, Picard F. Kernel-based testing for single-cell differential analysis. Genome Biol 2024; 25:114. [PMID: 38702740 PMCID: PMC11069218 DOI: 10.1186/s13059-024-03255-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 04/22/2024] [Indexed: 05/06/2024] Open
Abstract
Single-cell technologies offer insights into molecular feature distributions, but comparing them poses challenges. We propose a kernel-testing framework for non-linear cell-wise distribution comparison, analyzing gene expression and epigenomic modifications. Our method allows feature-wise and global transcriptome/epigenome comparisons, revealing cell population heterogeneities. Using a classifier based on embedding variability, we identify transitions in cell states, overcoming limitations of traditional single-cell analysis. Applied to single-cell ChIP-Seq data, our approach identifies untreated breast cancer cells with an epigenomic profile resembling persister cells. This demonstrates the effectiveness of kernel testing in uncovering subtle population variations that might be missed by other methods.
Collapse
Affiliation(s)
- A Ozier-Lafontaine
- Nantes Université, Centrale Nantes, Laboratoire de Mathématiques Jean Leray, CNRS UMR 6629, F-44000, Nantes, France.
| | - C Fourneaux
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - G Durif
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - P Arsenteva
- Nantes Université, Centrale Nantes, Laboratoire de Mathématiques Jean Leray, CNRS UMR 6629, F-44000, Nantes, France
| | - C Vallot
- CNRS UMR3244, Institut Curie, PSL University, Paris, France
- Translational Research Department, Institut Curie, PSL University, Paris, France
| | - O Gandrillon
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - S Gonin-Giraud
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France
| | - B Michel
- Nantes Université, Centrale Nantes, Laboratoire de Mathématiques Jean Leray, CNRS UMR 6629, F-44000, Nantes, France.
| | - F Picard
- Laboratory of Biology and Modelling of the Cell, Université de Lyon, Ecole Normale Supérieure de Lyon, CNRS, UMR5239, Université Claude Bernard Lyon 1, Lyon, France.
| |
Collapse
|
9
|
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Leveraging gene correlations in single cell transcriptomic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.14.532643. [PMID: 36993765 PMCID: PMC10055147 DOI: 10.1101/2023.03.14.532643] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
BACKGROUND Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data when ground truth about biological variation is unknown (i.e., usually). RESULTS We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate p-values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. CONCLUSIONS New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.
Collapse
Affiliation(s)
- Kai Silkwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| | - Emmanuel Dollinger
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
- Department of Mathematics, University of California, Irvine, Irvine CA
| | - Josh Gervin
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| | - Scott Atwood
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| | - Qing Nie
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
- Department of Mathematics, University of California, Irvine, Irvine CA
| | - Arthur D. Lander
- Center for Complex Biological Systems, University of California, Irvine, Irvine CA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA
| |
Collapse
|
10
|
Shakola F, Palejev D, Ivanov I. A Framework for Comparison and Assessment of Synthetic RNA-Seq Data. Genes (Basel) 2022; 13:2362. [PMID: 36553629 PMCID: PMC9778097 DOI: 10.3390/genes13122362] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/05/2022] [Accepted: 12/06/2022] [Indexed: 12/16/2022] Open
Abstract
The ever-growing number of methods for the generation of synthetic bulk and single cell RNA-seq data have multiple and diverse applications. They are often aimed at benchmarking bioinformatics algorithms for purposes such as sample classification, differential expression analysis, correlation and network studies and the optimization of data integration and normalization techniques. Here, we propose a general framework to compare synthetically generated RNA-seq data and select a data-generating tool that is suitable for a set of specific study goals. As there are multiple methods for synthetic RNA-seq data generation, researchers can use the proposed framework to make an informed choice of an RNA-seq data simulation algorithm and software that are best suited for their specific scientific questions of interest.
Collapse
Affiliation(s)
- Felitsiya Shakola
- GATE Institute, Sofia University, 125 Tsarigradsko Shosse, Bl. 2, 1113 Sofia, Bulgaria
| | - Dean Palejev
- Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Acad. G. Bonchev St., Bl. 8, 1113 Sofia, Bulgaria
| | - Ivan Ivanov
- Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
11
|
Sen Puliparambil B, Tomal JH, Yan Y. A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data. BIOLOGY 2022; 11:biology11101495. [PMID: 36290397 PMCID: PMC9598401 DOI: 10.3390/biology11101495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/21/2022] [Accepted: 09/30/2022] [Indexed: 11/05/2022]
Abstract
With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.
Collapse
Affiliation(s)
- Bhavithry Sen Puliparambil
- Master of Science in Data Science Program, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
- Correspondence:
| | - Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| | - Yan Yan
- Department of Computing Science, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| |
Collapse
|