1
|
Hrovatin K, Sikkema L, Shitov VA, Heimberg G, Shulman M, Oliver AJ, Mueller MF, Ibarra IL, Wang H, Ramírez-Suástegui C, He P, Schaar AC, Teichmann SA, Theis FJ, Luecken MD. Considerations for building and using integrated single-cell atlases. Nat Methods 2024:10.1038/s41592-024-02532-y. [PMID: 39672979 DOI: 10.1038/s41592-024-02532-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 10/22/2024] [Indexed: 12/15/2024]
Abstract
The rapid adoption of single-cell technologies has created an opportunity to build single-cell 'atlases' integrating diverse datasets across many laboratories. Such atlases can serve as a reference for analyzing and interpreting current and future data. However, it has become apparent that atlasing approaches differ, and the impact of these differences are often unclear. Here we review the current atlasing literature and present considerations for building and using atlases. Importantly, we find that no one-size-fits-all protocol for atlas building exists, but rather we discuss context-specific considerations and workflows, including atlas conceptualization, data collection, curation and integration, atlas evaluation and atlas sharing. We further highlight the benefits of integrated atlases for analyses of new datasets and deriving biological insights beyond what is possible from individual datasets. Our overview of current practices and associated recommendations will improve the quality of atlases to come, facilitating the shift to a unified, reference-based understanding of single-cell biology.
Collapse
Affiliation(s)
- Karin Hrovatin
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Lisa Sikkema
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Vladimir A Shitov
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Comprehensive Pneumology Center (CPC) with the CPC-M bioArchive / Institute of Lung Health and Immunity (LHI), Helmholtz Zentrum München; Member of the German Center for Lung Research (DZL), Munich, Germany
| | - Graham Heimberg
- Department of OMNI Bioinformatics, Genentech, South San Francisco, CA, USA
- Department of Biological Research | AI Development, Genentech, South San Francisco, CA, USA
| | - Maiia Shulman
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Amanda J Oliver
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Michaela F Mueller
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Ignacio L Ibarra
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Hanchen Wang
- Department of Biological Research | AI Development, Genentech, South San Francisco, CA, USA
- Department of Computer Science, Stanford University, Palo Alto, CA, USA
| | - Ciro Ramírez-Suástegui
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Peng He
- Department of Pathology, University of California, San Francisco, San Francisco, CA, USA
| | - Anna C Schaar
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Sarah A Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
- Theory of Condensed Matter Group, Department of Physics, Cavendish Laboratory, University of Cambridge, Cambridge, UK
- Cambridge Stem Cell Institute and Department of Medicine, University of Cambridge, Cambridge, UK
- CIFAR MacMillan Multiscale Human Programme, Toronto, Ontario, Canada
| | - Fabian J Theis
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
- Department of Mathematics, Technical University of Munich, Garching, Germany.
| | - Malte D Luecken
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- Comprehensive Pneumology Center (CPC) with the CPC-M bioArchive / Institute of Lung Health and Immunity (LHI), Helmholtz Zentrum München; Member of the German Center for Lung Research (DZL), Munich, Germany.
| |
Collapse
|
2
|
Liu T, Long W, Cao Z, Wang Y, He CH, Zhang L, Strittmatter SM, Zhao H. CosGeneGate selects multi-functional and credible biomarkers for single-cell analysis. Brief Bioinform 2024; 26:bbae626. [PMID: 39592241 PMCID: PMC11596696 DOI: 10.1093/bib/bbae626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 10/07/2024] [Accepted: 11/14/2024] [Indexed: 11/28/2024] Open
Abstract
MOTIVATION Selecting representative genes or marker genes to distinguish cell types is an important task in single-cell sequencing analysis. Although many methods have been proposed to select marker genes, the genes selected may have redundancy and/or do not show cell-type-specific expression patterns to distinguish cell types. RESULTS Here, we present a novel model, named CosGeneGate, to select marker genes for more effective marker selections. CosGeneGate is inspired by combining the advantages of selecting marker genes based on both cell-type classification accuracy and marker gene specific expression patterns. We demonstrate the better performance of the marker genes selected by CosGeneGate for various downstream analyses than the existing methods with both public datasets and newly sequenced datasets. The non-redundant marker genes identified by CosGeneGate for major cell types and tissues in human can be found at the website as follows: https://github.com/VivLon/CosGeneGate/blob/main/marker gene list.xlsx.
Collapse
Affiliation(s)
- Tianyu Liu
- Department of Biostatistics, Yale University, New Haven, CT, 06520, United States
- Interdepartmental Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06520, United States
| | - Wenxin Long
- Department of Biostatistics, Yale University, New Haven, CT, 06520, United States
- Department of Statistics, The Pennsylvania State University, University Park, PA, 16820, United States
| | - Zhiyuan Cao
- Department of Biostatistics, Yale University, New Haven, CT, 06520, United States
- Interdepartmental Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06520, United States
- Program of Health Informatics, Yale University, New Haven, CT, 06520, United States
| | - Yuge Wang
- Department of Biostatistics, Yale University, New Haven, CT, 06520, United States
| | - Chuan Hua He
- Department of Neurology, Yale University School of Medicine, New Haven, CT, 06520, United States
| | - Le Zhang
- Department of Neurology, Yale University School of Medicine, New Haven, CT, 06520, United States
- Department of Neuroscience, Yale University School of Medicine, New Haven, CT, 06520, United States
| | - Stephen M Strittmatter
- Department of Neurology, Yale University School of Medicine, New Haven, CT, 06520, United States
- Department of Neuroscience, Yale University School of Medicine, New Haven, CT, 06520, United States
- Cellular Neuroscience, Neurodegeneration and Repair Program, Yale University School of Medicine, New Haven, CT, 06520, United States
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, CT, 06520, United States
- Interdepartmental Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06520, United States
| |
Collapse
|
3
|
Hu H, Quon G. scPair: Boosting single cell multimodal analysis by leveraging implicit feature selection and single cell atlases. Nat Commun 2024; 15:9932. [PMID: 39548084 PMCID: PMC11568318 DOI: 10.1038/s41467-024-53971-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 10/25/2024] [Indexed: 11/17/2024] Open
Abstract
Multimodal single-cell assays profile multiple sets of features in the same cells and are widely used for identifying and mapping cell states between chromatin and mRNA and linking regulatory elements to target genes. However, the high dimensionality of input features and shallow sequencing depth compared to unimodal assays pose challenges in data analysis. Here we present scPair, a multimodal single-cell data framework that overcomes these challenges by employing an implicit feature selection approach. scPair uses dual encoder-decoder structures trained on paired data to align cell states across modalities and predict features from one modality to another. We demonstrate that scPair outperforms existing methods in accuracy and execution time, and facilitates downstream tasks such as trajectory inference. We further show scPair can augment smaller multimodal datasets with larger unimodal atlases to increase statistical power to identify groups of transcription factors active during different stages of neural differentiation.
Collapse
Affiliation(s)
- Hongru Hu
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA.
- Genome Center, University of California, Davis, CA, USA.
| | - Gerald Quon
- Genome Center, University of California, Davis, CA, USA.
- Department of Molecular and Cellular Biology, University of California, Davis, CA, USA.
| |
Collapse
|
4
|
Chen T, Huang C, Chen J, Xue J, Yang Z, Wang Y, Wu S, Wei W, Chen L, Liao S, Qin X, He R, Qin B, Liu C. Inorganic pyrophosphatase 1: a key player in immune and metabolic reprogramming in ankylosing spondylitis. Genes Immun 2024:10.1038/s41435-024-00308-0. [PMID: 39511317 DOI: 10.1038/s41435-024-00308-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 10/22/2024] [Accepted: 10/31/2024] [Indexed: 11/15/2024]
Abstract
The relationships among immune cells, metabolites, and AS events were analyzed via Mendelian randomization (MR), and potential immune cells and metabolites were identified as risk factors for AS. Their relationships were subjected to intermediary MR analysis to identify the final immune cells and metabolites. The vertebral bone marrow blood samples from three patients with and without AS were subjected to 10× single-cell sequencing to further elucidate the role of immune cells in AS. The key genes were screened via expression quantitative trait loci (eQTLs) and MR analyses. The metabolic differences between the two groups were compared through single-cell metabolism analysis. Two subgroups of differentiated (CD)8+ memory T cells and naive B cells were obtained from the combined results of intermediary MR analysis and AS single-cell analysis. After the verification of key genes, inorganic pyrophosphatase 1 (PPA1) was identified as the hub gene, as it is differentially expressed in CD8+ memory T cells and can affect the metabolism of T cells in AS by affecting the expression of ferulic acid (FA)4 sulfate, which participates in the cellular immunity in AS.
Collapse
Affiliation(s)
- Tianyou Chen
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Chengqian Huang
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Jiarui Chen
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Jiang Xue
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Zhenwei Yang
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Yihan Wang
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Songze Wu
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Wendi Wei
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Liyi Chen
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Shian Liao
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Xiaopeng Qin
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Rongqing He
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Boli Qin
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China
| | - Chong Liu
- The First Affiliated Hospital of Guangxi Medical University, No.6 Shuangyong Road, Nanning, Guangxi, 530021, People's Republic of China.
| |
Collapse
|
5
|
Chang LY, Hao TY, Wang WJ, Lin CY. Inference of single-cell network using mutual information for scRNA-seq data analysis. BMC Bioinformatics 2024; 25:292. [PMID: 39237886 PMCID: PMC11378379 DOI: 10.1186/s12859-024-05895-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Accepted: 08/08/2024] [Indexed: 09/07/2024] Open
Abstract
BACKGROUND With the advance in single-cell RNA sequencing (scRNA-seq) technology, deriving inherent biological system information from expression profiles at a single-cell resolution has become possible. It has been known that network modeling by estimating the associations between genes could better reveal dynamic changes in biological systems. However, accurately constructing a single-cell network (SCN) to capture the network architecture of each cell and further explore cell-to-cell heterogeneity remains challenging. RESULTS We introduce SINUM, a method for constructing the SIngle-cell Network Using Mutual information, which estimates mutual information between any two genes from scRNA-seq data to determine whether they are dependent or independent in a specific cell. Experiments on various scRNA-seq datasets with different cell numbers based on eight performance indexes (e.g., adjusted rand index and F-measure index) validated the accuracy and robustness of SINUM in cell type identification, superior to the state-of-the-art SCN inference method. Additionally, the SINUM SCNs exhibit high overlap with the human interactome and possess the scale-free property. CONCLUSIONS SINUM presents a view of biological systems at the network level to detect cell-type marker genes/gene pairs and investigate time-dependent changes in gene associations during embryo development. Codes for SINUM are freely available at https://github.com/SysMednet/SINUM .
Collapse
Affiliation(s)
- Lan-Yun Chang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan
| | - Ting-Yi Hao
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan
| | - Wei-Jie Wang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan
| | - Chun-Yu Lin
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan.
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan.
- Institute of Data Science and Engineering, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan.
- Center for Intelligent Drug Systems and Smart Bio-Devices, National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan.
- Cancer and Immunology Research Center, National Yang Ming Chiao Tung University, Taipei, 112, Taiwan.
- School of Dentistry, Kaohsiung Medical University, Kaohsiung, 807, Taiwan.
| |
Collapse
|
6
|
Xu Y, Wang S, Feng Q, Xia J, Li Y, Li HD, Wang J. scCAD: Cluster decomposition-based anomaly detection for rare cell identification in single-cell expression data. Nat Commun 2024; 15:7561. [PMID: 39215003 PMCID: PMC11364754 DOI: 10.1038/s41467-024-51891-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 08/15/2024] [Indexed: 09/04/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technologies have become essential tools for characterizing cellular landscapes within complex tissues. Large-scale single-cell transcriptomics holds great potential for identifying rare cell types critical to the pathogenesis of diseases and biological processes. Existing methods for identifying rare cell types often rely on one-time clustering using partial or global gene expression. However, these rare cell types may be overlooked during the clustering phase, posing challenges for their accurate identification. In this paper, we propose a Cluster decomposition-based Anomaly Detection method (scCAD), which iteratively decomposes clusters based on the most differential signals in each cluster to effectively separate rare cell types and achieve accurate identification. We benchmark scCAD on 25 real-world scRNA-seq datasets, demonstrating its superior performance compared to 10 state-of-the-art methods. In-depth case studies across diverse datasets, including mouse airway, brain, intestine, human pancreas, immunology data, and clear cell renal cell carcinoma, showcase scCAD's efficiency in identifying rare cell types in complex biological scenarios. Furthermore, scCAD can correct the annotation of rare cell types and identify immune cell subtypes associated with disease, thereby offering valuable insights into disease progression.
Collapse
Affiliation(s)
- Yunpei Xu
- School of Computer Science and Engineering, Central South University, Changsha, China
- Xiangjiang Laboratory, Changsha, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, China
| | - Shaokai Wang
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Qilong Feng
- School of Computer Science and Engineering, Central South University, Changsha, China
- Xiangjiang Laboratory, Changsha, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, China
| | - Jiazhi Xia
- School of Computer Science and Engineering, Central South University, Changsha, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA, USA
| | - Hong-Dong Li
- School of Computer Science and Engineering, Central South University, Changsha, China.
- Xiangjiang Laboratory, Changsha, China.
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, China.
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China.
- Xiangjiang Laboratory, Changsha, China.
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, China.
| |
Collapse
|
7
|
Du Q, Wang D, Zhang Y. The role of artificial intelligence in disease prediction: using ensemble model to predict disease mellitus. Front Med (Lausanne) 2024; 11:1425305. [PMID: 39170045 PMCID: PMC11335546 DOI: 10.3389/fmed.2024.1425305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 07/11/2024] [Indexed: 08/23/2024] Open
Abstract
The traditional complications of diabetes are well known and continue to pose a considerable burden to millions of people with diabetes mellitus (DM). With the continuous accumulation of medical data and technological advances, artificial intelligence has shown great potential and advantages in the prediction, diagnosis, and treatment of DM. When DM is diagnosed, some subjective factors and diagnostic methods of doctors will have an impact on the diagnostic results, so the use of artificial intelligence for fast and effective early prediction of DM patients can provide decision-making support to doctors and give more accurate treatment services to patients in time, which is of great clinical medical significance and practical significance. In this paper, an adaptive Stacking ensemble model is proposed based on the theory of "error-ambiguity decomposition," which can adaptively select the base classifiers from the pre-selected models. The adaptive Stacking ensemble model proposed in this paper is compared with KNN, SVM, RF, LR, DT, GBDT, XGBoost, LightGBM, CatBoost, MLP and traditional Stacking ensemble models. The results showed that the adaptive Stacking ensemble model achieved the best performance in five evaluation metrics: accuracy, precision, recall, F1 value and AUC value, which were 0.7559, 0.7286, 0.8132, 0.7686 and 0.8436. The model can effectively predict DM patients and provide a reference value for the screening and diagnosis of clinical DM.
Collapse
Affiliation(s)
| | | | - Yimin Zhang
- Key Laboratory of Traditional Chinese Medicine Classical Theory, Ministry of Education, Shandong University of Traditional Chinese Medicine, Jinan, China
| |
Collapse
|
8
|
Peeters F, Cappuyns S, Piqué-Gili M, Phillips G, Verslype C, Lambrechts D, Dekervel J. Applications of single-cell multi-omics in liver cancer. JHEP Rep 2024; 6:101094. [PMID: 39022385 PMCID: PMC11252522 DOI: 10.1016/j.jhepr.2024.101094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/18/2024] [Accepted: 03/27/2024] [Indexed: 07/20/2024] Open
Abstract
Primary liver cancer, more specifically hepatocellular carcinoma (HCC), remains a significant global health problem associated with increasing incidence and mortality. Clinical, biological, and molecular heterogeneity are well-known hallmarks of cancer and HCC is considered one of the most heterogeneous tumour types, displaying substantial inter-patient, intertumoural and intratumoural variability. This heterogeneity plays a pivotal role in hepatocarcinogenesis, metastasis, relapse and drug response or resistance. Unimodal single-cell sequencing techniques have already revolutionised our understanding of the different layers of molecular hierarchy in the tumour microenvironment of HCC. By highlighting the cellular heterogeneity and the intricate interactions among cancer, immune and stromal cells before and during treatment, these techniques have contributed to a deeper comprehension of tumour clonality, hematogenous spreading and the mechanisms of action of immune checkpoint inhibitors. However, major questions remain to be elucidated, with the identification of biomarkers predicting response or resistance to immunotherapy-based regimens representing an important unmet clinical need. Although the application of single-cell multi-omics in liver cancer research has been limited thus far, a revolution of individualised care for patients with HCC will only be possible by integrating various unimodal methods into multi-omics methodologies at the single-cell resolution. In this review, we will highlight the different established single-cell sequencing techniques and explore their biological and clinical impact on liver cancer research, while casting a glance at the future role of multi-omics in this dynamic and rapidly evolving field.
Collapse
Affiliation(s)
- Frederik Peeters
- Digestive Oncology, Department of Gastroenterology, University Hospitals Leuven, Leuven, Belgium
- Laboratory of Clinical Digestive Oncology, Department of Oncology, KU Leuven, Leuven, Belgium
- Laboratory for Translational Genetics, Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB Centre for Cancer Biology, Leuven, Belgium
| | - Sarah Cappuyns
- Digestive Oncology, Department of Gastroenterology, University Hospitals Leuven, Leuven, Belgium
- Laboratory of Clinical Digestive Oncology, Department of Oncology, KU Leuven, Leuven, Belgium
- Laboratory for Translational Genetics, Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB Centre for Cancer Biology, Leuven, Belgium
| | - Marta Piqué-Gili
- Liver Cancer Translational Research Laboratory, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Hospital Clínic, Universitat de Barcelona, Barcelona, Catalonia, Spain
| | - Gino Phillips
- Laboratory for Translational Genetics, Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB Centre for Cancer Biology, Leuven, Belgium
| | - Chris Verslype
- Digestive Oncology, Department of Gastroenterology, University Hospitals Leuven, Leuven, Belgium
- Laboratory of Clinical Digestive Oncology, Department of Oncology, KU Leuven, Leuven, Belgium
| | - Diether Lambrechts
- Laboratory for Translational Genetics, Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB Centre for Cancer Biology, Leuven, Belgium
| | - Jeroen Dekervel
- Digestive Oncology, Department of Gastroenterology, University Hospitals Leuven, Leuven, Belgium
- Laboratory of Clinical Digestive Oncology, Department of Oncology, KU Leuven, Leuven, Belgium
| |
Collapse
|
9
|
Nassiri I, Kwok AJ, Bhandari A, Bull KR, Garner LC, Klenerman P, Webber C, Parkkinen L, Lee AW, Wu Y, Fairfax B, Knight JC, Buck D, Piazza P. Demultiplexing of single-cell RNA-sequencing data using interindividual variation in gene expression. BIOINFORMATICS ADVANCES 2024; 4:vbae085. [PMID: 38911824 PMCID: PMC11193101 DOI: 10.1093/bioadv/vbae085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/11/2024] [Accepted: 06/07/2024] [Indexed: 06/25/2024]
Abstract
Motivation Pooled designs for single-cell RNA sequencing, where many cells from distinct samples are processed jointly, offer increased throughput and reduced batch variation. This study describes expression-aware demultiplexing (EAD), a computational method that employs differential co-expression patterns between individuals to demultiplex pooled samples without any extra experimental steps. Results We use synthetic sample pools and show that the top interindividual differentially co-expressed genes provide a distinct cluster of cells per individual, significantly enriching the regulation of metabolism. Our application of EAD to samples of six isogenic inbred mice demonstrated that controlling genetic and environmental effects can solve interindividual variations related to metabolic pathways. We utilized 30 samples from both sepsis and healthy individuals in six batches to assess the performance of classification approaches. The results indicate that combining genetic and EAD results can enhance the accuracy of assignments (Min. 0.94, Mean 0.98, Max. 1). The results were enhanced by an average of 1.4% when EAD and barcoding techniques were combined (Min. 1.25%, Median 1.33%, Max. 1.74%). Furthermore, we demonstrate that interindividual differential co-expression analysis within the same cell type can be used to identify cells from the same donor in different activation states. By analysing single-nuclei transcriptome profiles from the brain, we demonstrate that our method can be applied to nonimmune cells. Availability and implementation EAD workflow is available at https://isarnassiri.github.io/scDIV/ as an R package called scDIV (acronym for single-cell RNA-sequencing data demultiplexing using interindividual variations).
Collapse
Affiliation(s)
- Isar Nassiri
- Nuffield Department of Medicine, Centre for Human Genetics, Oxford-GSK Institute of Molecular and Computational Medicine (IMCM), University of Oxford, Oxford, OX3 7BN, United Kingdom
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, United Kingdom
- Department of Psychiatry, University of Oxford, Oxford, OX3 7JX, United Kingdom
| | - Andrew J Kwok
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
- Department of Medicine and Therapeutics, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Aneesha Bhandari
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
| | - Katherine R Bull
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
| | - Lucy C Garner
- Translational Gastroenterology Unit, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 9DU, United Kingdom
| | - Paul Klenerman
- Translational Gastroenterology Unit, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 9DU, United Kingdom
- Peter Medawar Building for Pathogen Research, University of Oxford, Oxford, OX1 3SY, United Kingdom
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford, OX3 9DU, United Kingdom
| | - Caleb Webber
- Department of Physiology, Anatomy, Genetics, Oxford Parkinson’s Disease Centre, University of Oxford, Oxford, OX1 3PT, United Kingdom
- UK Dementia Research Institute, Cardiff University, Cardiff, CF24 4HQ, United Kingdom
| | - Laura Parkkinen
- Nuffield Department of Medicine, Centre for Human Genetics, Oxford-GSK Institute of Molecular and Computational Medicine (IMCM), University of Oxford, Oxford, OX3 7BN, United Kingdom
- Nuffield Department of Clinical Neurosciences, Oxford Parkinson’s Disease Centre, University of Oxford, Oxford, OX3 9DU, United Kingdom
| | - Angela W Lee
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
| | - Yanxia Wu
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
| | - Benjamin Fairfax
- MRC–Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, United Kingdom
- Department of Oncology, University of Oxford & Oxford Cancer Centre, Churchill Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 7DQ, United Kingdom
| | - Julian C Knight
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
- Chinese Academy of Medical Science Oxford Institute, University of Oxford, Oxford, OX3 7BN, United Kingdom
| | - David Buck
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
| | - Paolo Piazza
- Nuffield Department of Medicine, Centre for Human Genetics, Oxford-GSK Institute of Molecular and Computational Medicine (IMCM), University of Oxford, Oxford, OX3 7BN, United Kingdom
- Nuffield Department of Medicine, Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
| |
Collapse
|
10
|
Wagle MM, Long S, Chen C, Liu C, Yang P. Interpretable deep learning in single-cell omics. Bioinformatics 2024; 40:btae374. [PMID: 38889275 PMCID: PMC11211213 DOI: 10.1093/bioinformatics/btae374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 05/11/2024] [Accepted: 06/12/2024] [Indexed: 06/20/2024] Open
Abstract
MOTIVATION Single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheless, the inherent multi-layer nonlinear architecture of deep learning models often makes them 'black boxes' as the reasoning behind predictions is often unknown and not transparent to the user. This has stimulated an increasing body of research for addressing the lack of interpretability in deep learning models, especially in single-cell omics data analyses, where the identification and understanding of molecular regulators are crucial for interpreting model predictions and directing downstream experimental validations. RESULTS In this work, we introduce the basics of single-cell omics technologies and the concept of interpretable deep learning. This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research. Lastly, we highlight the current limitations and discuss potential future directions.
Collapse
Affiliation(s)
- Manoj M Wagle
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Siqu Long
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Carissa Chen
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| |
Collapse
|
11
|
Zhang W, Yu R, Xu Z, Li J, Gao W, Jiang M, Dai Q. scCompressSA: dual-channel self-attention based deep autoencoder model for single-cell clustering by compressing gene-gene interactions. BMC Genomics 2024; 25:423. [PMID: 38684946 PMCID: PMC11059774 DOI: 10.1186/s12864-024-10286-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 04/04/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Single-cell clustering has played an important role in exploring the molecular mechanisms about cell differentiation and human diseases. Due to highly-stochastic transcriptomics data, accurate detection of cell types is still challenged, especially for RNA-sequencing data from human beings. In this case, deep neural networks have been increasingly employed to mine cell type specific patterns and have outperformed statistic approaches in cell clustering. RESULTS Using cross-correlation to capture gene-gene interactions, this study proposes the scCompressSA method to integrate topological patterns from scRNA-seq data, with support of self-attention (SA) based coefficient compression (CC) block. This SA-based CC block is able to extract and employ static gene-gene interactions from scRNA-seq data. This proposed scCompressSA method has enhanced clustering accuracy in multiple benchmark scRNA-seq datasets by integrating topological and temporal features. CONCLUSION Static gene-gene interactions have been extracted as temporal features to boost clustering performance in single-cell clustering For the scCompressSA method, dual-channel SA based CC block is able to integrate topological features and has exhibited extraordinary detection accuracy compared with previous clustering approaches that only employ temporal patterns.
Collapse
Affiliation(s)
- Wei Zhang
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Ruochen Yu
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Zeqi Xu
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Junnan Li
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Wenhao Gao
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Mingfeng Jiang
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China.
| | - Qi Dai
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China.
| |
Collapse
|
12
|
Ranek JS, Stallaert W, Milner JJ, Redick M, Wolff SC, Beltran AS, Stanley N, Purvis JE. DELVE: feature selection for preserving biological trajectories in single-cell data. Nat Commun 2024; 15:2765. [PMID: 38553455 PMCID: PMC10980758 DOI: 10.1038/s41467-024-46773-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 03/07/2024] [Indexed: 04/02/2024] Open
Abstract
Single-cell technologies can measure the expression of thousands of molecular features in individual cells undergoing dynamic biological processes. While examining cells along a computationally-ordered pseudotime trajectory can reveal how changes in gene or protein expression impact cell fate, identifying such dynamic features is challenging due to the inherent noise in single-cell data. Here, we present DELVE, an unsupervised feature selection method for identifying a representative subset of molecular features which robustly recapitulate cellular trajectories. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effects of confounding sources of variation, and instead models cell states from dynamic gene or protein modules based on core regulatory complexes. Using simulations, single-cell RNA sequencing, and iterative immunofluorescence imaging data in the context of cell cycle and cellular differentiation, we demonstrate how DELVE selects features that better define cell-types and cell-type transitions. DELVE is available as an open-source python package: https://github.com/jranek/delve .
Collapse
Affiliation(s)
- Jolene S Ranek
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wayne Stallaert
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - J Justin Milner
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, USA
| | - Margaret Redick
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Samuel C Wolff
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Adriana S Beltran
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Human Pluripotent Cell Core, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, USA
| | - Natalie Stanley
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| | - Jeremy E Purvis
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
13
|
Gregory W, Sarwar N, Kevrekidis G, Villar S, Dumitrascu B. MarkerMap: nonlinear marker selection for single-cell studies. NPJ Syst Biol Appl 2024; 10:17. [PMID: 38351188 PMCID: PMC10864304 DOI: 10.1038/s41540-024-00339-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 01/17/2024] [Indexed: 02/16/2024] Open
Abstract
Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable whole transcriptome reconstruction. MarkerMap provides a scalable framework for both supervised marker selection, aimed at identifying specific cell type populations, and unsupervised marker selection, aimed at gene expression imputation and reconstruction. We benchmark MarkerMap's competitive performance against previously published approaches on real single cell gene expression data sets. MarkerMap is available as a pip installable package, as a community resource aimed at developing explainable machine learning techniques for enhancing interpretability in single-cell studies.
Collapse
Affiliation(s)
- Wilson Gregory
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Nabeel Sarwar
- Center for Data Science, New York University, New York, NY, 10012, USA
| | - George Kevrekidis
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Soledad Villar
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA.
- Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
| | - Bianca Dumitrascu
- Department of Statistics, Columbia University, New York, NY, 10027, USA.
- Irving Institute for Cancer Dynamics, Columbia University, New York, NY, 10027, USA.
| |
Collapse
|
14
|
Lin Y, Wu TY, Chen X, Wan S, Chao B, Xin J, Yang JYH, Wong WH, Wang YXR. Data integration and inference of gene regulation using single-cell temporal multimodal data with scTIE. Genome Res 2024; 34:119-133. [PMID: 38190633 PMCID: PMC10903952 DOI: 10.1101/gr.277960.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 12/13/2023] [Indexed: 01/10/2024]
Abstract
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space by using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal data sets, we show scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome data set we generated from differentiating mouse embryonic stem cells over time, we show scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.
Collapse
Affiliation(s)
- Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR 999077, China
| | - Tung-Yu Wu
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA
| | - Xi Chen
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA
| | - Sheng Wan
- Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
| | - Brian Chao
- Department of Electrical Engineering, Stanford University, Stanford, California 94305-9505, USA
| | - Jingxue Xin
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA
| | - Jean Y H Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR 999077, China
| | - Wing H Wong
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA;
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305-5464, USA
- Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Y X Rachel Wang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia;
| |
Collapse
|
15
|
Tyler SR, Lozano-Ojalvo D, Guccione E, Schadt EE. Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq. Nat Commun 2024; 15:699. [PMID: 38267438 PMCID: PMC10808220 DOI: 10.1038/s41467-023-43406-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 11/07/2023] [Indexed: 01/26/2024] Open
Abstract
While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection/clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is called its own cluster. Using real and synthetic datasets, we find that anti-correlated gene selection reduces or eliminates erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells.
Collapse
Affiliation(s)
- Scott R Tyler
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Department of Oncological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Daniel Lozano-Ojalvo
- Department of Dermatology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ernesto Guccione
- Department of Oncological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Center for Therapeutics Discovery, Department of Oncological Sciences and Pharmacological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Bioinformatics for Next Generation Sequencing (BiNGS) Shared Resource Facility, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eric E Schadt
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
16
|
Wei Z, Chenjun W, Feiyang X, Mingfeng J, Yixuan Z, Qi L, Zhuoxing S, Qi D. scHybridBERT: integrating gene regulation and cell graph for spatiotemporal dynamics in single-cell clustering. Brief Bioinform 2024; 25:bbae018. [PMID: 38517692 PMCID: PMC10959234 DOI: 10.1093/bib/bbae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 12/19/2023] [Accepted: 01/09/2024] [Indexed: 03/24/2024] Open
Abstract
Graph learning models have received increasing attention in the computational analysis of single-cell RNA sequencing (scRNA-seq) data. Compared with conventional deep neural networks, graph neural networks and language models have exhibited superior performance by extracting graph-structured data from raw gene count matrices. Established deep neural network-based clustering approaches generally focus on temporal expression patterns while ignoring inherent interactions at gene-level as well as cell-level, which could be regarded as spatial dynamics in single-cell data. Both gene-gene and cell-cell interactions are able to boost the performance of cell type detection, under the framework of multi-view modeling. In this study, spatiotemporal embedding and cell graphs are extracted to capture spatial dynamics at the molecular level. In order to enhance the accuracy of cell type detection, this study proposes the scHybridBERT architecture to conduct multi-view modeling of scRNA-seq data using extracted spatiotemporal patterns. In this scHybridBERT method, graph learning models are employed to deal with cell graphs and the Performer model employs spatiotemporal embeddings. Experimental outcomes about benchmark scRNA-seq datasets indicate that the proposed scHybridBERT method is able to enhance the accuracy of single-cell clustering tasks by integrating spatiotemporal embeddings and cell graphs.
Collapse
Affiliation(s)
- Zhang Wei
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| | - Wu Chenjun
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| | - Xing Feiyang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, 200092, Shanghai, China
| | | | - Zhang Yixuan
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| | - Liu Qi
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, 200092, Shanghai, China
| | - Shi Zhuoxing
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, 510060, Guangzhou, China
| | - Dai Qi
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| |
Collapse
|
17
|
Kedziora KM, Stallaert W. Cell Cycle Mapping Using Multiplexed Immunofluorescence. Methods Mol Biol 2024; 2740:243-262. [PMID: 38393480 DOI: 10.1007/978-1-0716-3557-5_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2024]
Abstract
The development of technologies that allow measurement of the cell cycle at the single-cell level has revealed novel insights into the mechanisms that regulate cell cycle commitment and progression through DNA replication and cell division. These studies have also provided evidence of heterogeneity in cell cycle regulation among individual cells, even within a genetically identical population. Cell cycle mapping combines highly multiplexed imaging with manifold learning to visualize the diversity of "paths" that cells can take through the proliferative cell cycle or into various states of cell cycle arrest. In this chapter, we describe a general protocol of the experimental and computational components of cell cycle mapping. We also provide a comprehensive guide for the design and analysis of experiments, discussing key considerations in detail (e.g., antibody library preparation, analysis strategies, etc.) that may vary depending on the research question being addressed.
Collapse
Affiliation(s)
- Katarzyna M Kedziora
- Department of Cell Biology, Center for Biologic Imaging (CBI), University of Pittsburgh, Pittsburgh, PA, USA
| | - Wayne Stallaert
- Department of Computational and Systems Biology, UPMC Hillman Cancer Center, University of Pittsburgh, Pittsburgh, PA, USA.
| |
Collapse
|
18
|
Wang Z, Xie X, Liu S, Ji Z. scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data. Life Sci Alliance 2023; 6:e202302103. [PMID: 37788907 PMCID: PMC10547911 DOI: 10.26508/lsa.202302103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 09/21/2023] [Accepted: 09/22/2023] [Indexed: 10/05/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables researchers to reveal previously unknown cell heterogeneity and functional diversity, which is impossible with bulk RNA sequencing. Clustering approaches are widely used for analyzing scRNA-seq data and identifying cell types and states. In the past few years, various advanced computational strategies emerged. However, the low generalization and high computational cost are the main bottlenecks of existing methods. In this study, we established a novel computational framework, scFseCluster, for scRNA-seq clustering analysis. scFseCluster incorporates a metaheuristic algorithm (Feature Selection based on Quantum Squirrel Search Algorithm) to extract the optimal gene set, which largely guarantees the performance of cell clustering. We conducted simulation experiments in several aspects to verify the performance of the proposed approach. scFseCluster performed very well on eight benchmark scRNA-seq datasets because of the optimal gene sets obtained using the Feature Selection based on Quantum Squirrel Search Algorithm. The comparative study demonstrated the significant advantages of scFseCluster over seven State-of-the-Art algorithms. In addition, our analysis shows that feature selection on high-variable genes can significantly improve clustering performance. In conclusion, our study demonstrates that scFseCluster is a highly versatile tool for enhancing scRNA-seq data clustering analysis.
Collapse
Affiliation(s)
- Zongqin Wang
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
| | - Xiaojun Xie
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| | - Shouyang Liu
- Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
19
|
Huang H, Liu C, Wagle MM, Yang P. Evaluation of deep learning-based feature selection for single-cell RNA sequencing data analysis. Genome Biol 2023; 24:259. [PMID: 37950331 PMCID: PMC10638755 DOI: 10.1186/s13059-023-03100-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 10/24/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND Feature selection is an essential task in single-cell RNA-seq (scRNA-seq) data analysis and can be critical for gene dimension reduction and downstream analyses, such as gene marker identification and cell type classification. Most popular methods for feature selection from scRNA-seq data are based on the concept of differential distribution wherein a statistical model is used to detect changes in gene expression among cell types. Recent development of deep learning-based feature selection methods provides an alternative approach compared to traditional differential distribution-based methods in that the importance of a gene is determined by neural networks. RESULTS In this work, we explore the utility of various deep learning-based feature selection methods for scRNA-seq data analysis. We sample from Tabula Muris and Tabula Sapiens atlases to create scRNA-seq datasets with a range of data properties and evaluate the performance of traditional and deep learning-based feature selection methods for cell type classification, feature selection reproducibility and diversity, and computational time. CONCLUSIONS Our study provides a reference for future development and application of deep learning-based feature selection methods for single-cell omics data analyses.
Collapse
Affiliation(s)
- Hao Huang
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Manoj M Wagle
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, 2006, Australia.
| |
Collapse
|
20
|
Ng GYL, Tan SC, Ong CS. On the use of QDE-SVM for gene feature selection and cell type classification from scRNA-seq data. PLoS One 2023; 18:e0292961. [PMID: 37856458 PMCID: PMC10586655 DOI: 10.1371/journal.pone.0292961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Cell type identification is one of the fundamental tasks in single-cell RNA sequencing (scRNA-seq) studies. It is a key step to facilitate downstream interpretations such as differential expression, trajectory inference, etc. scRNA-seq data contains technical variations that could affect the interpretation of the cell types. Therefore, gene selection, also known as feature selection in data science, plays an important role in selecting informative genes for scRNA-seq cell type identification. Generally speaking, feature selection methods are categorized into filter-, wrapper-, and embedded-based approaches. From the existing literature, methods from filter- and embedded-based approaches are widely applied in scRNA-seq gene selection tasks. The wrapper-based method that gives promising results in other fields has yet been extensively utilized for selecting gene features from scRNA-seq data; in addition, most of the existing wrapper methods used in this field are clustering instead of classification-based. With a large number of annotated data available today, this study applied a classification-based approach as an alternative to the clustering-based wrapper method. In our work, a quantum-inspired differential evolution (QDE) wrapped with a classification method was introduced to select a subset of genes from twelve well-known scRNA-seq transcriptomic datasets to identify cell types. In particular, the QDE was combined with different machine-learning (ML) classifiers namely logistic regression, decision tree, support vector machine (SVM) with linear and radial basis function kernels, as well as extreme learning machine. The linear SVM wrapped with QDE, namely QDE-SVM, was chosen by referring to the feature selection results from the experiment. QDE-SVM showed a superior cell type classification performance among QDE wrapping with other ML classifiers as well as the recent wrapper methods (i.e., FSCAM, SSD-LAHC, MA-HS, and BSF). QDE-SVM achieved an average accuracy of 0.9559, while the other wrapper methods achieved average accuracies in the range of 0.8292 to 0.8872.
Collapse
Affiliation(s)
- Grace Yee Lin Ng
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| | - Shing Chiang Tan
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| | - Chia Sui Ong
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| |
Collapse
|
21
|
Kim D, Tran A, Kim HJ, Lin Y, Yang JYH, Yang P. Gene regulatory network reconstruction: harnessing the power of single-cell multi-omic data. NPJ Syst Biol Appl 2023; 9:51. [PMID: 37857632 PMCID: PMC10587078 DOI: 10.1038/s41540-023-00312-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/02/2023] [Indexed: 10/21/2023] Open
Abstract
Inferring gene regulatory networks (GRNs) is a fundamental challenge in biology that aims to unravel the complex relationships between genes and their regulators. Deciphering these networks plays a critical role in understanding the underlying regulatory crosstalk that drives many cellular processes and diseases. Recent advances in sequencing technology have led to the development of state-of-the-art GRN inference methods that exploit matched single-cell multi-omic data. By employing diverse mathematical and statistical methodologies, these methods aim to reconstruct more comprehensive and precise gene regulatory networks. In this review, we give a brief overview on the statistical and methodological foundations commonly used in GRN inference methods. We then compare and contrast the latest state-of-the-art GRN inference methods for single-cell matched multi-omics data, and discuss their assumptions, limitations and opportunities. Finally, we discuss the challenges and future directions that hold promise for further advancements in this rapidly developing field.
Collapse
Affiliation(s)
- Daniel Kim
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- Computational Systems Biology Unit, Children's Medical Research Institute, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
| | - Andy Tran
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia
| | - Hani Jieun Kim
- Computational Systems Biology Unit, Children's Medical Research Institute, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
| | - Yingxin Lin
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia.
| | - Pengyi Yang
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia.
- Computational Systems Biology Unit, Children's Medical Research Institute, University of Sydney, Camperdown, NSW, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia.
| |
Collapse
|
22
|
O'Connor LM, O'Connor BA, Lim SB, Zeng J, Lo CH. Integrative multi-omics and systems bioinformatics in translational neuroscience: A data mining perspective. J Pharm Anal 2023; 13:836-850. [PMID: 37719197 PMCID: PMC10499660 DOI: 10.1016/j.jpha.2023.06.011] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 06/20/2023] [Accepted: 06/25/2023] [Indexed: 09/19/2023] Open
Abstract
Bioinformatic analysis of large and complex omics datasets has become increasingly useful in modern day biology by providing a great depth of information, with its application to neuroscience termed neuroinformatics. Data mining of omics datasets has enabled the generation of new hypotheses based on differentially regulated biological molecules associated with disease mechanisms, which can be tested experimentally for improved diagnostic and therapeutic targeting of neurodegenerative diseases. Importantly, integrating multi-omics data using a systems bioinformatics approach will advance the understanding of the layered and interactive network of biological regulation that exchanges systemic knowledge to facilitate the development of a comprehensive human brain profile. In this review, we first summarize data mining studies utilizing datasets from the individual type of omics analysis, including epigenetics/epigenomics, transcriptomics, proteomics, metabolomics, lipidomics, and spatial omics, pertaining to Alzheimer's disease, Parkinson's disease, and multiple sclerosis. We then discuss multi-omics integration approaches, including independent biological integration and unsupervised integration methods, for more intuitive and informative interpretation of the biological data obtained across different omics layers. We further assess studies that integrate multi-omics in data mining which provide convoluted biological insights and offer proof-of-concept proposition towards systems bioinformatics in the reconstruction of brain networks. Finally, we recommend a combination of high dimensional bioinformatics analysis with experimental validation to achieve translational neuroscience applications including biomarker discovery, therapeutic development, and elucidation of disease mechanisms. We conclude by providing future perspectives and opportunities in applying integrative multi-omics and systems bioinformatics to achieve precision phenotyping of neurodegenerative diseases and towards personalized medicine.
Collapse
Affiliation(s)
- Lance M. O'Connor
- College of Biological Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Blake A. O'Connor
- School of Pharmacy, University of Wisconsin, Madison, WI, 53705, USA
| | - Su Bin Lim
- Department of Biochemistry and Molecular Biology, Ajou University School of Medicine, Suwon, 16499, South Korea
| | - Jialiu Zeng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| | - Chih Hung Lo
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| |
Collapse
|
23
|
Ferguson C, Zhang Y, Palego C, Cheng X. Recent Approaches to Design and Analysis of Electrical Impedance Systems for Single Cells Using Machine Learning. SENSORS (BASEL, SWITZERLAND) 2023; 23:5990. [PMID: 37447838 DOI: 10.3390/s23135990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 06/17/2023] [Accepted: 06/26/2023] [Indexed: 07/15/2023]
Abstract
Individual cells have many unique properties that can be quantified to develop a holistic understanding of a population. This can include understanding population characteristics, identifying subpopulations, or elucidating outlier characteristics that may be indicators of disease. Electrical impedance measurements are rapid and label-free for the monitoring of single cells and generate large datasets of many cells at single or multiple frequencies. To increase the accuracy and sensitivity of measurements and define the relationships between impedance and biological features, many electrical measurement systems have incorporated machine learning (ML) paradigms for control and analysis. Considering the difficulty capturing complex relationships using traditional modelling and statistical methods due to population heterogeneity, ML offers an exciting approach to the systemic collection and analysis of electrical properties in a data-driven way. In this work, we discuss incorporation of ML to improve the field of electrical single cell analysis by addressing the design challenges to manipulate single cells and sophisticated analysis of electrical properties that distinguish cellular changes. Looking forward, we emphasize the opportunity to build on integrated systems to address common challenges in data quality and generalizability to save time and resources at every step in electrical measurement of single cells.
Collapse
Affiliation(s)
- Caroline Ferguson
- Department of Bioengineering, Lehigh University, Bethlehem, PA 18015, USA
| | - Yu Zhang
- Department of Bioengineering, Lehigh University, Bethlehem, PA 18015, USA
| | - Cristiano Palego
- Department of Computer Science and Electronic Engineering, Bangor University, Bangor LL57 2DG, UK
| | - Xuanhong Cheng
- Department of Bioengineering, Lehigh University, Bethlehem, PA 18015, USA
- Department of Materials Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA
| |
Collapse
|
24
|
Yu L, Liu C, Yang JYH, Yang P. Ensemble deep learning of embeddings for clustering multimodal single-cell omics data. Bioinformatics 2023; 39:btad382. [PMID: 37314966 PMCID: PMC10287920 DOI: 10.1093/bioinformatics/btad382] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/16/2023] [Accepted: 06/12/2023] [Indexed: 06/16/2023] Open
Abstract
MOTIVATION Recent advances in multimodal single-cell omics technologies enable multiple modalities of molecular attributes, such as gene expression, chromatin accessibility, and protein abundance, to be profiled simultaneously at a global level in individual cells. While the increasing availability of multiple data modalities is expected to provide a more accurate clustering and characterization of cells, the development of computational methods that are capable of extracting information embedded across data modalities is still in its infancy. RESULTS We propose SnapCCESS for clustering cells by integrating data modalities in multimodal single-cell omics data using an unsupervised ensemble deep learning framework. By creating snapshots of embeddings of multimodality using variational autoencoders, SnapCCESS can be coupled with various clustering algorithms for generating consensus clustering of cells. We applied SnapCCESS with several clustering algorithms to various datasets generated from popular multimodal single-cell omics technologies. Our results demonstrate that SnapCCESS is effective and more efficient than conventional ensemble deep learning-based clustering methods and outperforms other state-of-the-art multimodal embedding generation methods in integrating data modalities for clustering cells. The improved clustering of cells from SnapCCESS will pave the way for more accurate characterization of cell identity and types, an essential step for various downstream analyses of multimodal single-cell omics data. AVAILABILITY AND IMPLEMENTATION SnapCCESS is implemented as a Python package and is freely available from https://github.com/PYangLab/SnapCCESS under the open-source license of GPL-3. The data used in this study are publicly available (see section 'Data availability').
Collapse
Affiliation(s)
- Lijia Yu
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D4H), Hong Kong Science Park, Hong Kong SAR, China
| | - Pengyi Yang
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D4H), Hong Kong Science Park, Hong Kong SAR, China
| |
Collapse
|
25
|
Lin Y, Wu TY, Chen X, Wan S, Chao B, Xin J, Yang JY, Wong WH, Wang YXR. scTIE: data integration and inference of gene regulation using single-cell temporal multimodal data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.18.541381. [PMID: 37292801 PMCID: PMC10245711 DOI: 10.1101/2023.05.18.541381] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal datasets, we demonstrate scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome dataset we generated from differentiating mouse embryonic stem cells over time, we demonstrate scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.
Collapse
Affiliation(s)
- Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, NSW, Australia
- Charles Perkins Centre, The University of Sydney, NSW, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Tung-Yu Wu
- Department of Statistics, Stanford University, CA, USA
| | - Xi Chen
- Department of Statistics, Stanford University, CA, USA
| | - Sheng Wan
- Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Brian Chao
- Department of Electrical Engineering, Stanford University, CA, USA
| | - Jingxue Xin
- Department of Statistics, Stanford University, CA, USA
| | - Jean Y.H. Yang
- School of Mathematics and Statistics, The University of Sydney, NSW, Australia
- Charles Perkins Centre, The University of Sydney, NSW, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Wing H. Wong
- Department of Statistics, Stanford University, CA, USA
- Department of Biomedical Data Science, Stanford University, CA, USA
- Bio-X Program, Stanford University, CA, USA
| | - Y. X. Rachel Wang
- School of Mathematics and Statistics, The University of Sydney, NSW, Australia
| |
Collapse
|
26
|
Ranek JS, Stallaert W, Milner J, Stanley N, Purvis JE. Feature selection for preserving biological trajectories in single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.09.540043. [PMID: 37214963 PMCID: PMC10197710 DOI: 10.1101/2023.05.09.540043] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Single-cell technologies can readily measure the expression of thousands of molecular features from individual cells undergoing dynamic biological processes, such as cellular differentiation, immune response, and disease progression. While examining cells along a computationally ordered pseudotime offers the potential to study how subtle changes in gene or protein expression impact cell fate decision-making, identifying characteristic features that drive continuous biological processes remains difficult to detect from unenriched and noisy single-cell data. Given that all profiled sources of feature variation contribute to the cell-to-cell distances that define an inferred cellular trajectory, including confounding sources of biological variation (e.g. cell cycle or metabolic state) or noisy and irrelevant features (e.g. measurements with low signal-to-noise ratio) can mask the underlying trajectory of study and hinder inference. Here, we present DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that recapitulates cellular trajectories. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes. Using simulations, single-cell RNA sequencing data, and iterative immunofluorescence imaging data in the context of the cell cycle and cellular differentiation, we demonstrate that DELVE selects features that more accurately characterize cell populations and improve the recovery of cell type transitions. This feature selection framework provides an alternative approach for improving trajectory inference and uncovering co-variation amongst features along a biological trajectory. DELVE is implemented as an open-source python package and is publicly available at: https://github.com/jranek/delve.
Collapse
Affiliation(s)
- Jolene S. Ranek
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wayne Stallaert
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Justin Milner
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Natalie Stanley
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jeremy E. Purvis
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
27
|
Cao Y, Ghazanfar S, Yang P, Yang J. Benchmarking of analytical combinations for COVID-19 outcome prediction using single-cell RNA sequencing data. Brief Bioinform 2023; 24:7140296. [PMID: 37096588 DOI: 10.1093/bib/bbad159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 03/30/2023] [Accepted: 04/03/2023] [Indexed: 04/26/2023] Open
Abstract
The advances of single-cell transcriptomic technologies have led to increasing use of single-cell RNA sequencing (scRNA-seq) data in large-scale patient cohort studies. The resulting high-dimensional data can be summarized and incorporated into patient outcome prediction models in several ways; however, there is a pressing need to understand the impact of analytical decisions on such model quality. In this study, we evaluate the impact of analytical choices on model choices, ensemble learning strategies and integrate approaches on patient outcome prediction using five scRNA-seq COVID-19 datasets. First, we examine the difference in performance between using single-view feature space versus multi-view feature space. Next, we survey multiple learning platforms from classical machine learning to modern deep learning methods. Lastly, we compare different integration approaches when combining datasets is necessary. Through benchmarking such analytical combinations, our study highlights the power of ensemble learning, consistency among different learning methods and robustness to dataset normalization when using multiple datasets as the model input.
Collapse
Affiliation(s)
- Yue Cao
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Shila Ghazanfar
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, NSW 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Jean Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| |
Collapse
|
28
|
Deng T, Chen S, Zhang Y, Xu Y, Feng D, Wu H, Sun X. A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis. Brief Bioinform 2023; 24:bbad042. [PMID: 36754847 PMCID: PMC10025445 DOI: 10.1093/bib/bbad042] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/05/2022] [Accepted: 01/18/2023] [Indexed: 02/10/2023] Open
Abstract
Feature gene selection has significant impact on the performance of cell clustering in single-cell RNA sequencing (scRNA-seq) analysis. A well-rounded feature selection (FS) method should consider relevance, redundancy and complementarity of the features. Yet most existing FS methods focus on gene relevance to the cell types but neglect redundancy and complementarity, which undermines the cell clustering performance. We develop a novel computational method GeneClust to select feature genes for scRNA-seq cell clustering. GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. It can work as a plug-in tool for FS with any existing cell clustering method. Extensive benchmark results demonstrate that GeneClust significantly improve the clustering performance. Moreover, GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset. GeneClust is freely available at https://github.com/ToryDeng/scGeneClust.
Collapse
Affiliation(s)
- Tao Deng
- School of Data Science, The Chinese University of Hong Kong—Shenzhen, Guangdong, China
| | - Siyu Chen
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Ying Zhang
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Yuanbin Xu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Da Feng
- School of Pharmacy, Tongji Medical College, Huazhong University of Sciences and Technology, Hubei, China
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, GA, USA
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Xiaobo Sun
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| |
Collapse
|
29
|
Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity. Genes (Basel) 2023; 14:genes14020248. [PMID: 36833178 PMCID: PMC9956296 DOI: 10.3390/genes14020248] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 01/11/2023] [Accepted: 01/12/2023] [Indexed: 01/20/2023] Open
Abstract
The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
Collapse
|
30
|
Kim HJ, O'Hara-Wright M, Kim D, Loi TH, Lim BY, Jamieson RV, Gonzalez-Cordero A, Yang P. Comprehensive characterization of fetal and mature retinal cell identity to assess the fidelity of retinal organoids. Stem Cell Reports 2023; 18:175-189. [PMID: 36630901 PMCID: PMC9860116 DOI: 10.1016/j.stemcr.2022.12.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 12/07/2022] [Accepted: 12/07/2022] [Indexed: 01/12/2023] Open
Abstract
Characterizing cell identity in complex tissues such as the human retina is essential for studying its development and disease. While retinal organoids derived from pluripotent stem cells have been widely used to model development and disease of the human retina, there is a lack of studies that have systematically evaluated the molecular and cellular fidelity of the organoids derived from various culture protocols in recapitulating their in vivo counterpart. To this end, we performed an extensive meta-atlas characterization of cellular identities of the human eye, covering a wide range of developmental stages. The resulting map uncovered previously unknown biomarkers of major retinal cell types and those associated with cell-type-specific maturation. Using our retinal-cell-identity map from the fetal and adult tissues, we systematically assessed the fidelity of the retinal organoids in mimicking the human eye, enabling us to comprehensively benchmark the current protocols for retinal organoid generation.
Collapse
Affiliation(s)
- Hani Jieun Kim
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia; School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia; School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Michelle O'Hara-Wright
- School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia; Stem Cell Medicine Group, Children's Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Daniel Kim
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia; School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - To Ha Loi
- School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia; Eye Genetics Research Unit, Children's Medical Research Institute, Sydney Children's Hospitals Network, Save Sight Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Benjamin Y Lim
- Stem Cell Medicine Group, Children's Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Robyn V Jamieson
- Specialty of Genomic Medicine, Faculty of Medicine and Health, University of Sydney, Westmead, NSW 2145, Australia; Eye Genetics Research Unit, Children's Medical Research Institute, Sydney Children's Hospitals Network, Save Sight Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Anai Gonzalez-Cordero
- School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia; Stem Cell Medicine Group, Children's Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia.
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia; School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia; School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia.
| |
Collapse
|
31
|
Paplomatas P, Vlamos P, Vrahatis AG. A Comparison of the Various Methods for Selecting Features for Single-Cell RNA Sequencing Data in Alzheimer's Disease. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2023; 1424:241-246. [PMID: 37486500 DOI: 10.1007/978-3-031-31982-2_27] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
The high-throughput sequencing method known as RNA-Seq records the whole transcriptome of individual cells. Single-cell RNA sequencing, also known as scRNA-Seq, is widely utilized in the field of biomedical research and has resulted in the generation of huge quantities and types of data. The noise and artifacts that are present in the raw data require extensive cleaning before they can be used. When applied to applications for machine learning or pattern recognition, feature selection methods offer a method to reduce the amount of time spent on calculation while simultaneously improving predictions and offering a better knowledge of the data. The process of discovering biomarkers is analogous to feature selection methods used in machine learning and is especially helpful for applications in the medical field. An attempt is made by a feature selection algorithm to cut down on the total number of features by eliminating those that are unnecessary or redundant while retaining those that are the most helpful.We apply FS algorithms designed for scRNA-Seq to Alzheimer's disease, which is the most prevalent neurodegenerative disease in the western world and causes cognitive and behavioral impairment. AD is clinically and pathologically varied, and genetic studies imply a diversity of biological mechanisms and pathways. Over 20 new Alzheimer's disease susceptibility loci have been discovered through linkage, genome-wide association, and next-generation sequencing (Tosto G, Reitz C, Mol Cell Probes 30:397-403, 2016). In this study, we focus on the performance of three different approaches to marker gene selection methods and compare them using the support vector machine (SVM), k-nearest neighbors' algorithm (k-NN), and linear discriminant analysis (LDA), which are mainly supervised classification algorithms.
Collapse
Affiliation(s)
- Petros Paplomatas
- Bioinformatics and Human Electrophysiology Lab (BiHELab), Department of Informatics, Ionian University, Corfu, Greece
| | - Panagiotis Vlamos
- Bioinformatics and Human Electrophysiology Lab (BiHELab), Department of Informatics, Ionian University, Corfu, Greece
| | - Aristidis G Vrahatis
- Bioinformatics and Human Electrophysiology Lab (BiHELab), Department of Informatics, Ionian University, Corfu, Greece
| |
Collapse
|
32
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
- Department of Immunology, Nanjing Medical University, Nanjing, 211166 China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children’s Hospital of Chongqing Medical University, Chongqing, 401174 China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110 Guangdong China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Murray J. Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW 2308 Australia
- Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW 2305 Australia
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| |
Collapse
|
33
|
Cao Y, Lin Y, Patrick E, Yang P, Yang JYH. scFeatures: multi-view representations of single-cell and spatial data for disease outcome prediction. Bioinformatics 2022; 38:4745-4753. [PMID: 36040148 PMCID: PMC9563679 DOI: 10.1093/bioinformatics/btac590] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 07/21/2022] [Accepted: 08/28/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION With the recent surge of large-cohort scale single cell research, it is of critical importance that analytical methods can fully utilize the comprehensive characterization of cellular systems that single cell technologies produce to provide insights into samples from individuals. Currently, there is little consensus on the best ways to compress information from the complex data structures of these technologies to summary statistics that represent each sample (e.g. individuals). RESULTS Here, we present scFeatures, an approach that creates interpretable cellular and molecular representations of single-cell and spatial data at the sample level. We demonstrate that summarizing a broad collection of features at the sample level is both important for understanding underlying disease mechanisms in different experimental studies and for accurately classifying disease status of individuals. AVAILABILITY AND IMPLEMENTATION scFeatures is publicly available as an R package at https://github.com/SydneyBioX/scFeatures. All data used in this study are publicly available with accession ID reported in the Section 2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yue Cao
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Yingxin Lin
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Ellis Patrick
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Computational Systems Biology Group, Children’s Medical Research Institute, Westmead, NSW 2145, Australia
| | - Pengyi Yang
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Computational Systems Biology Group, Children’s Medical Research Institute, Westmead, NSW 2145, Australia
| | - Jean Yee Hwa Yang
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| |
Collapse
|
34
|
Li R, Banjanin B, Schneider RK, Costa IG. Detection of cell markers from single cell RNA-seq with sc2marker. BMC Bioinformatics 2022; 23:276. [PMID: 35831796 PMCID: PMC9281170 DOI: 10.1186/s12859-022-04817-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 06/28/2022] [Indexed: 11/12/2022] Open
Abstract
Background Single-cell RNA sequencing (scRNA-seq) allows the detection of rare cell types in complex tissues. The detection of markers for rare cell types is useful for further biological analysis of, for example, flow cytometry and imaging data sets for either physical isolation or spatial characterization of these cells. However, only a few computational approaches consider the problem of selecting specific marker genes from scRNA-seq data. Results Here, we propose sc2marker, which is based on the maximum margin index and a database of proteins with antibodies, to select markers for flow cytometry or imaging. We evaluated the performances of sc2marker and competing methods in ranking known markers in scRNA-seq data of immune and stromal cells. The results showed that sc2marker performed better than the competing methods in accuracy, while having a competitive running time. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04817-5.
Collapse
Affiliation(s)
- Ronghui Li
- Joint Research Center for Computational Biomedicine, Institute for Computational Genomics, RWTH Aachen University, Aachen, Germany
| | - Bella Banjanin
- Department of Cell Biology, Institute for Biomedical Engineering, RWTH Aachen University, Aachen, Germany
| | - Rebekka K Schneider
- Department of Cell Biology, Institute for Biomedical Engineering, RWTH Aachen University, Aachen, Germany
| | - Ivan G Costa
- Joint Research Center for Computational Biomedicine, Institute for Computational Genomics, RWTH Aachen University, Aachen, Germany.
| |
Collapse
|
35
|
Zhong P, Wei X, Li X, Wei X, Wu S, Huang W, Koidis A, Xu Z, Lei H. Untargeted metabolomics by liquid chromatography‐mass spectrometry for food authentication: A review. Compr Rev Food Sci Food Saf 2022; 21:2455-2488. [DOI: 10.1111/1541-4337.12938] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 02/20/2022] [Accepted: 02/21/2022] [Indexed: 12/17/2022]
Affiliation(s)
- Peng Zhong
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Xiaoqun Wei
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Xiangmei Li
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Xiaoyi Wei
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Shaozong Wu
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Weijuan Huang
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Anastasios Koidis
- Institute for Global Food Security Queen's University Belfast Belfast UK
| | - Zhenlin Xu
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
| | - Hongtao Lei
- Guangdong Provincial Key Laboratory of Food Quality and Safety / National–Local Joint Engineering Research Center for Precision Machining and Safety of Livestock and Poultry Products, College of Food Science South China Agricultural University Guangzhou 510642 China
- Guangdong Laboratory for Lingnan Modern Agriculture South China Agricultural University Guangzhou 510642 China
| |
Collapse
|
36
|
Zhou L, Wang H. A Combined Feature Screening Approach of Random Forest and Filter-based Methods for Ultra-high Dimensional Data. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220221120618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Various feature (variable) screening approaches have been proposed in the past decade to mitigate the impact of ultra-high dimensionality in classification and regression problems, including filter based methods such as sure indepen¬dence screening, and wrapper based methods such random forest. However, the former type of methods rely heavily on strong modelling assumptions while the latter ones requires an adequate sample size to make the data speak for themselves. These require¬ments can seldom be met in biochemical studies in cases where we have only access to ultra-high dimensional data with a complex structure and a small number of observations.
Objective:
In this research, we want to investigate the possibility of combing both filter based screening methods and random forest based screening methods in the regression context.
Method:
We have combined four state-of-art filter approaches, namely, sure independence screening (SIS) , robust rank corre¬lation based screening (RRCS), high dimensional ordinary least squares projection (HOLP) and a model free sure independence screening procedure based on the distance correlation (DCSIS) from the statistical community with a random forest based Boruta screening method from the machine learning community for regression problems.
Result:
Among all combined methods, RF-DCSIS performs better than the other methods in terms of screening accuracy and prediction capability on the simulated scenarios and real benchmark datasets.
Conclusion:
By empirical study from both extensive simulation and real data, we have shown that both filter based screening and random forest based screening have their pros and cons while a combination of both may lead to a better feature screening result and prediction capability
Keywords:
feature screening, filter-based method, ultra-high dimensional data, variable selection, random forest,RF-DCSIS
Collapse
Affiliation(s)
- Lifeng Zhou
- School of Economics and Management, Changsha University, China
| | - Hong Wang
- School of Mathematics and Statistics, Central South University, China
| |
Collapse
|
37
|
Graph Based Feature Selection for Reduction of Dimensionality in Next-Generation RNA Sequencing Datasets. ALGORITHMS 2022. [DOI: 10.3390/a15010021] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Analysis of high-dimensional data, with more features (p) than observations (N) (p>N), places significant demand in cost and memory computational usage attributes. Feature selection can be used to reduce the dimensionality of the data. We used a graph-based approach, principal component analysis (PCA) and recursive feature elimination to select features for classification from RNAseq datasets from two lung cancer datasets. The selected features were discretized for association rule mining where support and lift were used to generate informative rules. Our results show that the graph-based feature selection improved the performance of sequential minimal optimization (SMO) and multilayer perceptron classifiers (MLP) in both datasets. In association rule mining, features selected using the graph-based approach outperformed the other two feature-selection techniques at a support of 0.5 and lift of 2. The non-redundant rules reflect the inherent relationships between features. Biological features are usually related to functions in living systems, a relationship that cannot be deduced by feature selection and classification alone. Therefore, the graph-based feature-selection approach combined with rule mining is a suitable way of selecting and finding associations between features in high-dimensional RNAseq data.
Collapse
|