1
|
Guo Q, Yuan M, Zhang L, Deng M. scPLAN: a hierarchical computational framework for single transcriptomics data annotation, integration and cell-type label refinement. Brief Bioinform 2024; 25:bbae305. [PMID: 38935069 PMCID: PMC11209730 DOI: 10.1093/bib/bbae305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 05/22/2024] [Accepted: 06/11/2024] [Indexed: 06/28/2024] Open
Abstract
MOTIVATION In the past decade, single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal method for transcriptomic profiling in biomedical research. Precise cell-type identification is crucial for subsequent analysis of single-cell data. And the integration and refinement of annotated data are essential for building comprehensive databases. However, prevailing annotation techniques often overlook the hierarchical organization of cell types, resulting in inconsistent annotations. Meanwhile, most existing integration approaches fail to integrate datasets with different annotation depths and none of them can enhance the labels of outdated data with lower annotation resolutions using more intricately annotated datasets or novel biological findings. RESULTS Here, we introduce scPLAN, a hierarchical computational framework designed for scRNA-seq data analysis. scPLAN excels in annotating unlabeled scRNA-seq data using a reference dataset structured along a hierarchical cell-type tree. It identifies potential novel cell types in a systematic, layer-by-layer manner. Additionally, scPLAN effectively integrates annotated scRNA-seq datasets with varying levels of annotation depth, ensuring consistent refinement of cell-type labels across datasets with lower resolutions. Through extensive annotation and novel cell detection experiments, scPLAN has demonstrated its efficacy. Two case studies have been conducted to showcase how scPLAN integrates datasets with diverse cell-type label resolutions and refine their cell-type labels. AVAILABILITY https://github.com/michaelGuo1204/scPLAN.
Collapse
Affiliation(s)
- Qirui Guo
- Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China
| | - Musu Yuan
- Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China
| | - Lei Zhang
- Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China
- Beijing International Center for Mathematical Research, Peking University, Yiheyuan Road, 100871, Beijing, China
- Center for Machine Learning Research, Peking University, Yiheyuan Road, 100871, Beijing, China
| | - Minghua Deng
- Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China
- School of Mathematical Sciences, Peking University, Yiheyuan Road, 100871, Beijing, China
- Center for Statistical Science, Peking University, Yiheyuan Road, 100871, Beijing, China
| |
Collapse
|
2
|
Ong Ly C, Unnikrishnan B, Tadic T, Patel T, Duhamel J, Kandel S, Moayedi Y, Brudno M, Hope A, Ross H, McIntosh C. Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data. NPJ Digit Med 2024; 7:124. [PMID: 38744921 PMCID: PMC11094145 DOI: 10.1038/s41746-024-01118-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 04/23/2024] [Indexed: 05/16/2024] Open
Abstract
Healthcare datasets are becoming larger and more complex, necessitating the development of accurate and generalizable AI models for medical applications. Unstructured datasets, including medical imaging, electrocardiograms, and natural language data, are gaining attention with advancements in deep convolutional neural networks and large language models. However, estimating the generalizability of these models to new healthcare settings without extensive validation on external data remains challenging. In experiments across 13 datasets including X-rays, CTs, ECGs, clinical discharge summaries, and lung auscultation data, our results demonstrate that model performance is frequently overestimated by up to 20% on average due to shortcut learning of hidden data acquisition biases (DAB). Shortcut learning refers to a phenomenon in which an AI model learns to solve a task based on spurious correlations present in the data as opposed to features directly related to the task itself. We propose an open source, bias-corrected external accuracy estimate, PEst, that better estimates external accuracy to within 4% on average by measuring and calibrating for DAB-induced shortcut learning.
Collapse
Affiliation(s)
- Cathy Ong Ly
- Peter Munk Cardiac Centre and Ted Rogers Centre for Heart Research, University Health Network, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Toronto General Hospital Research Institute, University Health Network, Toronto, ON, Canada
| | - Balagopal Unnikrishnan
- Toronto General Hospital Research Institute, University Health Network, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Tony Tadic
- Radiation Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Tirth Patel
- Radiation Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Joe Duhamel
- Peter Munk Cardiac Centre and Ted Rogers Centre for Heart Research, University Health Network, Toronto, ON, Canada
| | - Sonja Kandel
- Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Yasbanoo Moayedi
- Peter Munk Cardiac Centre and Ted Rogers Centre for Heart Research, University Health Network, Toronto, ON, Canada
| | - Michael Brudno
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Andrew Hope
- Radiation Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Heather Ross
- Peter Munk Cardiac Centre and Ted Rogers Centre for Heart Research, University Health Network, Toronto, ON, Canada
| | - Chris McIntosh
- Peter Munk Cardiac Centre and Ted Rogers Centre for Heart Research, University Health Network, Toronto, ON, Canada.
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.
- Toronto General Hospital Research Institute, University Health Network, Toronto, ON, Canada.
- Department of Computer Science, University of Toronto, Toronto, ON, Canada.
- Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada.
- Vector Institute, Toronto, ON, Canada.
- Radiation Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.
- Department of Medical Imaging, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
3
|
Gan D, Zhu Y, Lu X, Li J. SCIPAC: quantitative estimation of cell-phenotype associations. Genome Biol 2024; 25:119. [PMID: 38741183 PMCID: PMC11089691 DOI: 10.1186/s13059-024-03263-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 04/30/2024] [Indexed: 05/16/2024] Open
Abstract
Numerous algorithms have been proposed to identify cell types in single-cell RNA sequencing data, yet a fundamental problem remains: determining associations between cells and phenotypes such as cancer. We develop SCIPAC, the first algorithm that quantitatively estimates the association between each cell in single-cell data and a phenotype. SCIPAC also provides a p-value for each association and applies to data with virtually any type of phenotype. We demonstrate SCIPAC's accuracy in simulated data. On four real cancerous or noncancerous datasets, insights from SCIPAC help interpret the data and generate new hypotheses. SCIPAC requires minimum tuning and is computationally very fast.
Collapse
Affiliation(s)
- Dailin Gan
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, 46556, IN, USA
| | - Yini Zhu
- Department of Biological Sciences, Boler-Parseghian Center for Rare and Neglected Diseases, Harper Cancer Research Institute, Integrated Biomedical Sciences Graduate Program, University of Notre Dame, Notre Dame, 46556, IN, USA
| | - Xin Lu
- Department of Biological Sciences, Boler-Parseghian Center for Rare and Neglected Diseases, Harper Cancer Research Institute, Integrated Biomedical Sciences Graduate Program, University of Notre Dame, Notre Dame, 46556, IN, USA
- Tumor Microenvironment and Metastasis Program, Indiana University Melvin and Bren Simon Comprehensive Cancer Center, Indianapolis, 46202, IN, USA
| | - Jun Li
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, 46556, IN, USA.
| |
Collapse
|
4
|
Dong S, Deng K, Huang X. Single-cell type annotation with deep learning in 265 cell types for humans. BIOINFORMATICS ADVANCES 2024; 4:vbae054. [PMID: 38645719 PMCID: PMC11031354 DOI: 10.1093/bioadv/vbae054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 03/17/2024] [Accepted: 04/05/2024] [Indexed: 04/23/2024]
Abstract
Motivation Annotating cell types is a challenging yet essential task in analyzing single-cell RNA sequencing data. However, due to the lack of a gold standard, it is difficult to evaluate the algorithms fairly and an overfitting algorithm may be favored in benchmarks. To address this challenge, we developed a deep learning-based single-cell type prediction tool that assigns the cell type to 265 different cell types for humans, based on data from approximately five million cells. Results We achieved a median area under the ROC curve (AUC) of 0.93 when evaluated across datasets. We found that inconsistent labeling in the existing database generated by different labs contributed to the mistakes of the model. Therefore, we used cell ontology to correct the annotations and retrained the model, which resulted in 0.971 median AUC. Our study reveals a limiting factor of the accuracy one may achieve with the current database annotation and points to the solutions towards an algorithm-based correction of the gold standard for future automated cell annotation approaches. Availability and implementation The code is available at: https://github.com/SherrySDong/Hierarchical-Correction-Improves-Automated-Single-cell-Type-Annotation. Data used in this study are listed in Supplementary Table S1 and are retrievable at the CZI database.
Collapse
Affiliation(s)
- Sherry Dong
- Skyline High School, Ann Arbor, MI 48103, United States
- National AI Campus and Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA 90069, United States
| | - Kaiwen Deng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States
| | - Xiuzhen Huang
- National AI Campus and Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA 90069, United States
| |
Collapse
|
5
|
Zhang Y, Sun H, Zhang W, Fu T, Huang S, Mou M, Zhang J, Gao J, Ge Y, Yang Q, Zhu F. CellSTAR: a comprehensive resource for single-cell transcriptomic annotation. Nucleic Acids Res 2024; 52:D859-D870. [PMID: 37855686 PMCID: PMC10767908 DOI: 10.1093/nar/gkad874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/12/2023] [Accepted: 09/27/2023] [Indexed: 10/20/2023] Open
Abstract
Large-scale studies of single-cell sequencing and biological experiments have successfully revealed expression patterns that distinguish different cell types in tissues, emphasizing the importance of studying cellular heterogeneity and accurately annotating cell types. Analysis of gene expression profiles in these experiments provides two essential types of data for cell type annotation: annotated references and canonical markers. In this study, the first comprehensive database of single-cell transcriptomic annotation resource (CellSTAR) was thus developed. It is unique in (a) offering the comprehensive expertly annotated reference data for annotating hundreds of cell types for the first time and (b) enabling the collective consideration of reference data and marker genes by incorporating tens of thousands of markers. Given its unique features, CellSTAR is expected to attract broad research interests from the technological innovations in single-cell transcriptomics, the studies of cellular heterogeneity & dynamics, and so on. It is now publicly accessible without any login requirement at: https://idrblab.org/cellstar.
Collapse
Affiliation(s)
- Ying Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Huaicheng Sun
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Wei Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Tingting Fu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Shijie Huang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Jinsong Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Jianqing Gao
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Yichao Ge
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- Department of Bioinformatics, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
6
|
Hou W, Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.16.537094. [PMID: 37131626 PMCID: PMC10153208 DOI: 10.1101/2023.04.16.537094] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We assessed the performance of GPT-4, a highly potent large language model, for cell type annotation, and demonstrated that it can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations and has the potential to considerably reduce the effort and expertise needed in cell type annotation. We also developed GPTCelltype, an open-source R software package to facilitate cell type annotation by GPT-4.
Collapse
Affiliation(s)
- Wenpin Hou
- Department of Biostatistics, The Mailman School of Public Health, Columbia University, New York City, NY, USA
| | - Zhicheng Ji
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
7
|
Lazaros K, Vlamos P, Vrahatis AG. Methods for cell-type annotation on scRNA-seq data: A recent overview. J Bioinform Comput Biol 2023; 21:2340002. [PMID: 37743364 DOI: 10.1142/s0219720023400024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
The evolution of single-cell technology is ongoing, continually generating massive amounts of data that reveal many mysteries surrounding intricate diseases. However, their drawbacks continue to constrain us. Among these, annotating cell types in single-cell gene expressions pose a substantial challenge, despite the myriad of tools at our disposal. The rapid growth in data, resources, and tools has consequently brought about significant alterations in this area over the years. In our study, we spotlight all note-worthy cell type annotation techniques developed over the past four years. We provide an overview of the latest trends in this field, showcasing the most advanced methods in taxonomy. Our research underscores the demand for additional tools that incorporate a biological context and also predicts that the rising trend of graph neural network approaches will likely lead this research field in the coming years.
Collapse
Affiliation(s)
- Konstantinos Lazaros
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
| | - Panagiotis Vlamos
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
| | - Aristidis G Vrahatis
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
| |
Collapse
|
8
|
Xiong G, Bekiranov S, Zhang A. ProtoCell4P: an explainable prototype-based neural network for patient classification using single-cell RNA-seq. Bioinformatics 2023; 39:btad493. [PMID: 37540223 PMCID: PMC10444962 DOI: 10.1093/bioinformatics/btad493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 07/09/2023] [Accepted: 08/03/2023] [Indexed: 08/05/2023] Open
Abstract
MOTIVATION The rapid advance in single-cell RNA sequencing (scRNA-seq) technology over the past decade has provided a rich resource of gene expression profiles of single cells measured on patients, facilitating the study of many biological questions at the single-cell level. One intriguing research is to study the single cells which play critical roles in the phenotypes of patients, which has the potential to identify those cells and genes driving the disease phenotypes. To this end, deep learning models are expected to well encode the single-cell information and achieve precise prediction of patients' phenotypes using scRNA-seq data. However, we are facing critical challenges in designing deep learning models for classifying patient samples due to (i) the samples collected in the same dataset contain a variable number of cells-some samples might only have hundreds of cells sequenced while others could have thousands of cells, and (ii) the number of samples available is typically small and the expression profile of each cell is noisy and extremely high-dimensional. Moreover, the black-box nature of existing deep learning models makes it difficult for the researchers to interpret the models and extract useful knowledge from them. RESULTS We propose a prototype-based and cell-informed model for patient phenotype classification, termed ProtoCell4P, that can alleviate problems of the sample scarcity and the diverse number of cells by leveraging the cell knowledge with representatives of cells (called prototypes), and precisely classify the patients by adaptively incorporating information from different cells. Moreover, this classification process can be explicitly interpreted by identifying the key cells for decision making and by further summarizing the knowledge of cell types to unravel the biological nature of the classification. Our approach is explainable at the single-cell resolution which can identify the key cells in each patient's classification. The experimental results demonstrate that our proposed method can effectively deal with patient classifications using single-cell data and outperforms the existing approaches. Furthermore, our approach is able to uncover the association between cell types and biological classes of interest from a data-driven perspective. AVAILABILITY AND IMPLEMENTATION https://github.com/Teddy-XiongGZ/ProtoCell4P.
Collapse
Affiliation(s)
- Guangzhi Xiong
- Department of Computer Science, University of Virginia, Charlottesville, VA, United States
| | - Stefan Bekiranov
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
9
|
Biharie K, Michielsen L, Reinders MJT, Mahfouz A. Cell type matching across species using protein embeddings and transfer learning. Bioinformatics 2023; 39:i404-i412. [PMID: 37387141 PMCID: PMC10311290 DOI: 10.1093/bioinformatics/btad248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Knowing the relation between cell types is crucial for translating experimental results from mice to humans. Establishing cell type matches, however, is hindered by the biological differences between the species. A substantial amount of evolutionary information between genes that could be used to align the species is discarded by most of the current methods since they only use one-to-one orthologous genes. Some methods try to retain the information by explicitly including the relation between genes, however, not without caveats. RESULTS In this work, we present a model to transfer and align cell types in cross-species analysis (TACTiCS). First, TACTiCS uses a natural language processing model to match genes using their protein sequences. Next, TACTiCS employs a neural network to classify cell types within a species. Afterward, TACTiCS uses transfer learning to propagate cell type labels between species. We applied TACTiCS on scRNA-seq data of the primary motor cortex of human, mouse, and marmoset. Our model can accurately match and align cell types on these datasets. Moreover, our model outperforms Seurat and the state-of-the-art method SAMap. Finally, we show that our gene matching method results in better cell type matches than BLAST in our model. AVAILABILITY AND IMPLEMENTATION The implementation is available on GitHub (https://github.com/kbiharie/TACTiCS). The preprocessed datasets and trained models can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.7582460).
Collapse
Affiliation(s)
- Kirti Biharie
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Lieke Michielsen
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Ahmed Mahfouz
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| |
Collapse
|
10
|
Liu Y, Wei G, Li C, Shen LC, Gasser RB, Song J, Chen D, Yu DJ. TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level. Brief Bioinform 2023; 24:bbad132. [PMID: 37080771 PMCID: PMC10199768 DOI: 10.1093/bib/bbad132] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 02/02/2023] [Accepted: 03/14/2023] [Indexed: 04/22/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has significantly accelerated the experimental characterization of distinct cell lineages and types in complex tissues and organisms. Cell-type annotation is of great importance in most of the scRNA-seq analysis pipelines. However, manual cell-type annotation heavily relies on the quality of scRNA-seq data and marker genes, and therefore can be laborious and time-consuming. Furthermore, the heterogeneity of scRNA-seq datasets poses another challenge for accurate cell-type annotation, such as the batch effect induced by different scRNA-seq protocols and samples. To overcome these limitations, here we propose a novel pipeline, termed TripletCell, for cross-species, cross-protocol and cross-sample cell-type annotation. We developed a cell embedding and dimension-reduction module for the feature extraction (FE) in TripletCell, namely TripletCell-FE, to leverage the deep metric learning-based algorithm for the relationships between the reference gene expression matrix and the query cells. Our experimental studies on 21 datasets (covering nine scRNA-seq protocols, two species and three tissues) demonstrate that TripletCell outperformed state-of-the-art approaches for cell-type annotation. More importantly, regardless of protocols or species, TripletCell can deliver outstanding and robust performance in annotating different types of cells. TripletCell is freely available at https://github.com/liuyan3056/TripletCell. We believe that TripletCell is a reliable computational tool for accurately annotating various cell types using scRNA-seq data and will be instrumental in assisting the generation of novel biological hypotheses in cell biology.
Collapse
Affiliation(s)
- Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Guo Wei
- School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Dijun Chen
- School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
11
|
Hou W, Ji Z. Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis. RESEARCH SQUARE 2023:rs.3.rs-2824971. [PMID: 37205379 PMCID: PMC10187429 DOI: 10.21203/rs.3.rs-2824971/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrate that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations, and has the potential to considerably reduce the effort and expertise needed in cell type annotation.
Collapse
Affiliation(s)
- Wenpin Hou
- Department of Biostatistics, The Mailman School of Public Health, Columbia University, New York City, NY, USA
| | - Zhicheng Ji
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
12
|
Pei G, Yan F, Simon LM, Dai Y, Jia P, Zhao Z. deCS: A Tool for Systematic Cell Type Annotations of Single-cell RNA Sequencing Data among Human Tissues. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:370-384. [PMID: 35470070 PMCID: PMC10626171 DOI: 10.1016/j.gpb.2022.04.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 03/25/2022] [Accepted: 04/07/2022] [Indexed: 02/02/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is revolutionizing the study of complex and dynamic cellular mechanisms. However, cell type annotation remains a main challenge as it largely relies on a priori knowledge and manual curation, which is cumbersome and subjective. The increasing number of scRNA-seq datasets, as well as numerous published genetic studies, has motivated us to build a comprehensive human cell type reference atlas.Here, we present decoding Cell type Specificity (deCS), an automatic cell type annotation method augmented by a comprehensive collection of human cell type expression profiles and marker genes. We used deCS to annotate scRNA-seq data from various tissue types and systematically evaluated the annotation accuracy under different conditions, including reference panels, sequencing depth, and feature selection strategies. Our results demonstrate that expanding the references is critical for improving annotation accuracy. Compared to many existing state-of-the-art annotation tools, deCS significantly reduced computation time and increased accuracy. deCS can be integrated into the standard scRNA-seq analytical pipeline to enhance cell type annotation. Finally, we demonstrated the broad utility of deCS to identify trait-cell type associations in 51 human complex traits, providing deep insights into the cellular mechanisms underlying disease pathogenesis. All documents for deCS, including source code, user manual, demo data, and tutorials, are freely available at https://github.com/bsml320/deCS.
Collapse
Affiliation(s)
- Guangsheng Pei
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Fangfang Yan
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Lukas M Simon
- Therapeutic Innovation Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yulin Dai
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA; Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA; MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.
| |
Collapse
|
13
|
Sun Y, Qiu P. Domain adaptation for supervised integration of scRNA-seq data. Commun Biol 2023; 6:274. [PMID: 36928806 PMCID: PMC10020569 DOI: 10.1038/s42003-023-04668-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 03/06/2023] [Indexed: 03/18/2023] Open
Abstract
Large-scale scRNA-seq studies typically generate data in batches, which often induce nontrivial batch effects that need to be corrected. Given the global efforts for building cell atlases and the increasing number of annotated scRNA-seq datasets accumulated, we propose a supervised strategy for scRNA-seq data integration called SIDA (Supervised Integration using Domain Adaptation), which uses the cell type annotations to guide the integration of diverse batches. The supervised strategy is based on domain adaptation that was initially proposed in the computer vision field. We demonstrate that SIDA is able to generate comprehensive reference datasets that lead to improved accuracy in automated cell type mapping analyses.
Collapse
Affiliation(s)
- Yutong Sun
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Peng Qiu
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, USA.
| |
Collapse
|
14
|
Ratnasiri K, Wilk AJ, Lee MJ, Khatri P, Blish CA. Single-cell RNA-seq methods to interrogate virus-host interactions. Semin Immunopathol 2023; 45:71-89. [PMID: 36414692 PMCID: PMC9684776 DOI: 10.1007/s00281-022-00972-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 10/31/2022] [Indexed: 11/23/2022]
Abstract
The twenty-first century has seen the emergence of many epidemic and pandemic viruses, with the most recent being the SARS-CoV-2-driven COVID-19 pandemic. As obligate intracellular parasites, viruses rely on host cells to replicate and produce progeny, resulting in complex virus and host dynamics during an infection. Single-cell RNA sequencing (scRNA-seq), by enabling broad and simultaneous profiling of both host and virus transcripts, represents a powerful technology to unravel the delicate balance between host and virus. In this review, we summarize technological and methodological advances in scRNA-seq and their applications to antiviral immunity. We highlight key scRNA-seq applications that have enabled the understanding of viral genomic and host response heterogeneity, differential responses of infected versus bystander cells, and intercellular communication networks. We expect further development of scRNA-seq technologies and analytical methods, combined with measurements of additional multi-omic modalities and increased availability of publicly accessible scRNA-seq datasets, to enable a better understanding of viral pathogenesis and enhance the development of antiviral therapeutics strategies.
Collapse
Affiliation(s)
- Kalani Ratnasiri
- Stanford Immunology Program, Stanford University School of Medicine, Stanford, CA, 94305, USA
- Department of Medicine, Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Aaron J Wilk
- Stanford Immunology Program, Stanford University School of Medicine, Stanford, CA, 94305, USA
- Department of Medicine, Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
- Medical Scientist Training Program, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Madeline J Lee
- Stanford Immunology Program, Stanford University School of Medicine, Stanford, CA, 94305, USA
- Department of Medicine, Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Purvesh Khatri
- Department of Medicine, Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Institute for Immunity, Transplantation and Infection, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Department of Medicine, Center for Biomedical Informatics Research, Stanford, CA, USA.
- Inflammatix, Inc., Sunnyvale, CA, 94085, USA.
| | - Catherine A Blish
- Stanford Immunology Program, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Department of Medicine, Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Medical Scientist Training Program, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Institute for Immunity, Transplantation and Infection, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA.
| |
Collapse
|
15
|
Christensen E, Luo P, Turinsky A, Husić M, Mahalanabis A, Naidas A, Diaz-Mejia JJ, Brudno M, Pugh T, Ramani A, Shooshtari P. Evaluation of single-cell RNAseq labelling algorithms using cancer datasets. Brief Bioinform 2022; 24:6965910. [PMID: 36585784 PMCID: PMC9851326 DOI: 10.1093/bib/bbac561] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 09/19/2022] [Accepted: 11/01/2022] [Indexed: 01/01/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) clustering and labelling methods are used to determine precise cellular composition of tissue samples. Automated labelling methods rely on either unsupervised, cluster-based approaches or supervised, cell-based approaches to identify cell types. The high complexity of cancer poses a unique challenge, as tumor microenvironments are often composed of diverse cell subpopulations with unique functional effects that may lead to disease progression, metastasis and treatment resistance. Here, we assess 17 cell-based and 9 cluster-based scRNA-seq labelling algorithms using 8 cancer datasets, providing a comprehensive large-scale assessment of such methods in a cancer-specific context. Using several performance metrics, we show that cell-based methods generally achieved higher performance and were faster compared to cluster-based methods. Cluster-based methods more successfully labelled non-malignant cell types, likely because of a lack of gene signatures for relevant malignant cell subpopulations. Larger cell numbers present in some cell types in training data positively impacted prediction scores for cell-based methods. Finally, we examined which methods performed favorably when trained and tested on separate patient cohorts in scenarios similar to clinical applications, and which were able to accurately label particularly small or under-represented cell populations in the given datasets. We conclude that scPred and SVM show the best overall performances with cancer-specific data and provide further suggestions for algorithm selection. Our analysis pipeline for assessing the performance of cell type labelling algorithms is available in https://github.com/shooshtarilab/scRNAseq-Automated-Cell-Type-Labelling.
Collapse
Affiliation(s)
| | | | - Andrei Turinsky
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Mia Husić
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alaina Mahalanabis
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alaine Naidas
- Children’s Health Research Institute, Lawson Research Institute, London, ON, Canada
- Department of Pathology and Lab Medicine, University of Western Ontario, London, ON, Canada
| | | | - Michael Brudno
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Trevor Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Ontario Institute for Cancer Research, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | - Arun Ramani
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Parisa Shooshtari
- Corresponding author: Parisa Shooshtari, Department of Pathology and Lab Medicine, University of Western Ontario, London, ON, Canada. Tel.: +1 (519) 685-8500 x55427. E-mail:
| |
Collapse
|
16
|
Guo H, Yang Z, Jiang T, Liu S, Wang Y, Cui Z. Evaluation of classification in single cell atac-seq data with machine learning methods. BMC Bioinformatics 2022; 23:249. [PMID: 36131234 PMCID: PMC9494763 DOI: 10.1186/s12859-022-04774-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 06/08/2022] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND The technologies advances of single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) allowed to generate thousands of single cells in a relatively easy and economic manner and it is rapidly advancing the understanding of the cellular composition of complex organisms and tissues. The data structure and feature in scRNA-seq is similar to that in scATAC-seq, therefore, it's encouraged to identify and classify the cell types in scATAC-seq through traditional supervised machine learning methods, which are proved reliable in scRNA-seq datasets. RESULTS In this study, we evaluated the classification performance of 6 well-known machine learning methods on scATAC-seq. A total of 4 public scATAC-seq datasets vary in tissues, sizes and technologies were applied to the evaluation of the performance of the methods. We assessed these methods using a 5-folds cross validation experiment, called intra-dataset experiment, based on recall, precision and the percentage of correctly predicted cells. The results show that these methods performed well in some specific types of the cell in a specific scATAC-seq dataset, while the overall performance is not as well as that in scRNA-seq analysis. In addition, we evaluated the classification performance of these methods by training and predicting in different datasets generated from same sample, called inter-datasets experiments, which may help us to assess the performance of these methods in more realistic scenarios. CONCLUSIONS Both in intra-dataset and in inter-dataset experiment, SVM and NMC are overall outperformed others across all 4 datasets. Thus, we recommend researchers to use SVM and NMC as the underlying classifier when developing an automatic cell-type classification method for scATAC-seq.
Collapse
Affiliation(s)
- Hongzhe Guo
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Zhongbo Yang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Tao Jiang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Shiqi Liu
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Yadong Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China.
| | - Zhe Cui
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
17
|
Madadi Y, Sun J, Chen H, Williams R, Yousefi S. Detecting retinal neural and stromal cell classes and ganglion cell subtypes based on transcriptome data with deep transfer learning. Bioinformatics 2022; 38:4321-4329. [PMID: 35876552 PMCID: PMC9991888 DOI: 10.1093/bioinformatics/btac514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 07/11/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION To develop and assess the accuracy of deep learning models that identify different retinal cell types, as well as different retinal ganglion cell (RGC) subtypes, based on patterns of single-cell RNA sequencing (scRNA-seq) in multiple datasets. RESULTS Deep domain adaptation models were developed and tested using three different datasets. The first dataset included 44 808 single retinal cells from mice (39 cell types) with 24 658 genes, the second dataset included 6225 single RGCs from mice (41 subtypes) with 13 616 genes and the third dataset included 35 699 single RGCs from mice (45 subtypes) with 18 222 genes. We used four loss functions in the learning process to align the source and target distributions, reduce misclassification errors and maximize robustness. Models were evaluated based on classification accuracy and confusion matrix. The accuracy of the model for correctly classifying 39 different retinal cell types in the first dataset was ∼92%. Accuracy in the second and third datasets reached ∼97% and 97% in correctly classifying 40 and 45 different RGCs subtypes, respectively. Across a range of seven different batches in the first dataset, the accuracy of the lead model ranged from 74% to nearly 100%. The lead model provided high accuracy in identifying retinal cell types and RGC subtypes based on scRNA-seq data. The performance was reasonable based on data from different batches as well. The validated model could be readily applied to scRNA-seq data to identify different retinal cell types and subtypes. AVAILABILITY AND IMPLEMENTATION The code and datasets are available on https://github.com/DM2LL/Detecting-Retinal-Cell-Classes-and-Ganglion-Cell-Subtypes. We have also added the class labels of all samples to the datasets. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yeganeh Madadi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
- University of Tehran, Tehran, Iran
| | - Jian Sun
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert Williams
- Department of Genetics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Siamak Yousefi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
- Department of Genetics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| |
Collapse
|
18
|
Johnson TS, Yu CY, Huang Z, Xu S, Wang T, Dong C, Shao W, Zaid MA, Huang X, Wang Y, Bartlett C, Zhang Y, Walker BA, Liu Y, Huang K, Zhang J. Diagnostic Evidence GAuge of Single cells (DEGAS): a flexible deep transfer learning framework for prioritizing cells in relation to disease. Genome Med 2022; 14:11. [PMID: 35105355 PMCID: PMC8808996 DOI: 10.1186/s13073-022-01012-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 01/07/2022] [Indexed: 12/13/2022] Open
Abstract
We propose DEGAS (Diagnostic Evidence GAuge of Single cells), a novel deep transfer learning framework, to transfer disease information from patients to cells. We call such transferrable information "impressions," which allow individual cells to be associated with disease attributes like diagnosis, prognosis, and response to therapy. Using simulated data and ten diverse single-cell and patient bulk tissue transcriptomic datasets from glioblastoma multiforme (GBM), Alzheimer's disease (AD), and multiple myeloma (MM), we demonstrate the feasibility, flexibility, and broad applications of the DEGAS framework. DEGAS analysis on myeloma single-cell transcriptomics identified PHF19high myeloma cells associated with progression. Availability: https://github.com/tsteelejohnson91/DEGAS .
Collapse
Affiliation(s)
- Travis S Johnson
- Department of Medicine, Indiana University School of Medicine, 535 Barnhill Dr, Indianapolis, IN, 46202, USA
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH, 43210, USA
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, 410 W 10th St, Suite 3000, Indianapolis, IN, 46202, USA
| | - Christina Y Yu
- Department of Medicine, Indiana University School of Medicine, 535 Barnhill Dr, Indianapolis, IN, 46202, USA
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH, 43210, USA
| | - Zhi Huang
- School of Electrical and Computer Engineering, Purdue University, 465 Northwestern Ave, West Lafayette, IN, 47907, USA
| | - Siwen Xu
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 W. 10th St, Suite 5000, Indianapolis, IN, 46202, USA
| | - Tongxin Wang
- Department of Computer Science, Indiana University, 150 S Woodlawn Ave, Bloomington, IN, 47405, USA
| | - Chuanpeng Dong
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 W. 10th St, Suite 5000, Indianapolis, IN, 46202, USA
| | - Wei Shao
- Department of Medicine, Indiana University School of Medicine, 535 Barnhill Dr, Indianapolis, IN, 46202, USA
| | - Mohammad Abu Zaid
- Department of Medicine, Indiana University School of Medicine, 535 Barnhill Dr, Indianapolis, IN, 46202, USA
| | - Xiaoqing Huang
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, 410 W 10th St, Suite 3000, Indianapolis, IN, 46202, USA
| | - Yijie Wang
- Department of Computer Science, Indiana University, 150 S Woodlawn Ave, Bloomington, IN, 47405, USA
| | - Christopher Bartlett
- Battelle Center for Mathematical Medicine, Nationwide Children's Hospital, 575 Children's Crossroad, Columbus, OH, 43215, USA
| | - Yan Zhang
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH, 43210, USA
- The Ohio State University Comprehensive Cancer Center (OSUCCC - James), Starling-Loving Hall, 320 W 10th Ave, Columbus, OH, 43210, USA
| | - Brian A Walker
- Division of Hematology Oncology, Indiana University Melvin and Bren Simon Comprehensive Cancer Center, 535 Barnhill Dr, Indianapolis, IN, 46202, USA
| | - Yunlong Liu
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 W. 10th St, Suite 5000, Indianapolis, IN, 46202, USA
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, 410 W 10th St, Suite 4000, Indianapolis, IN, 46202, USA
| | - Kun Huang
- Department of Medicine, Indiana University School of Medicine, 535 Barnhill Dr, Indianapolis, IN, 46202, USA.
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, 410 W 10th St, Suite 3000, Indianapolis, IN, 46202, USA.
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, 410 W 10th St, Suite 4000, Indianapolis, IN, 46202, USA.
- Regenstrief Institute, 1101 W 10th St, Indianapolis, IN, 46202, USA.
| | - Jie Zhang
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, 410 W 10th St, Suite 4000, Indianapolis, IN, 46202, USA.
| |
Collapse
|
19
|
Zeng Y, Wei Z, Pan Z, Lu Y, Yang Y. A robust and scalable graph neural network for accurate single-cell classification. Brief Bioinform 2022; 23:6501353. [PMID: 35018408 DOI: 10.1093/bib/bbab570] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 12/01/2021] [Accepted: 12/11/2021] [Indexed: 12/25/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) techniques provide high-resolution data on cellular heterogeneity in diverse tissues, and a critical step for the data analysis is cell type identification. Traditional methods usually cluster the cells and manually identify cell clusters through marker genes, which is time-consuming and subjective. With the launch of several large-scale single-cell projects, millions of sequenced cells have been annotated and it is promising to transfer labels from the annotated datasets to newly generated datasets. One powerful way for the transferring is to learn cell relations through the graph neural network (GNN), but traditional GNNs are difficult to process millions of cells due to the expensive costs of the message-passing procedure at each training epoch. Here, we have developed a robust and scalable GNN-based method for accurate single-cell classification (GraphCS), where the graph is constructed to connect similar cells within and between labelled and unlabeled scRNA-seq datasets for propagation of shared information. To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity over cell numbers. Compared with existing methods, GraphCS demonstrates better performance on simulated, cross-platform, cross-species and cross-omics scRNA-seq datasets. More importantly, our model provides a high speed and scalability on large datasets, and can achieve superior performance for 1 million cells within 50 min.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zhuoyi Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China
| |
Collapse
|
20
|
Cui Z, Cui Y, Gao Y, Jiang T, Zang T, Wang Y. Enhancement and Imputation of Peak Signal Enables Accurate Cell-Type Classification in scATAC-seq. Front Genet 2021; 12:658352. [PMID: 33889181 PMCID: PMC8056015 DOI: 10.3389/fgene.2021.658352] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 02/22/2021] [Indexed: 11/16/2022] Open
Abstract
Single-cell Assay Transposase Accessible Chromatin sequencing (scATAC-seq) has been widely used in profiling genome-wide chromatin accessibility in thousands of individual cells. However, compared with single-cell RNA-seq, the peaks of scATAC-seq are much sparser due to the lower copy numbers (diploid in humans) and the inherent missing signals, which makes it more challenging to classify cell type based on specific expressed gene or other canonical markers. Here, we present svmATAC, a support vector machine (SVM)-based method for accurately identifying cell types in scATAC-seq datasets by enhancing peak signal strength and imputing signals through patterns of co-accessibility. We applied svmATAC to several scATAC-seq data from human immune cells, human hematopoietic system cells, and peripheral blood mononuclear cells. The benchmark results showed that svmATAC is free of literature-based markers and robust across datasets in different libraries and platforms. The source code of svmATAC is available at https://github.com/mrcuizhe/svmATAC under the MIT license.
Collapse
Affiliation(s)
- Zhe Cui
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Ya Cui
- College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Yan Gao
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tao Jiang
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tianyi Zang
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
21
|
Guo H, Li J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol 2021; 22:69. [PMID: 33618746 PMCID: PMC7898451 DOI: 10.1186/s13059-021-02281-7] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 01/27/2021] [Indexed: 12/13/2022] Open
Abstract
On single-cell RNA-sequencing data, we consider the problem of assigning cells to known cell types, assuming that the identities of cell-type-specific marker genes are given but their exact expression levels are unavailable, that is, without using a reference dataset. Based on an observation that the expected over-expression of marker genes is often absent in a nonnegligible proportion of cells, we develop a method called scSorter. scSorter allows marker genes to express at a low level and borrows information from the expression of non-marker genes. On both simulated and real data, scSorter shows much higher power compared to existing methods.
Collapse
Affiliation(s)
- Hongyu Guo
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 102 Crowley Hall, Notre Dame, USA
| | - Jun Li
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 102 Crowley Hall, Notre Dame, USA.
| |
Collapse
|
22
|
Pasquini G, Rojo Arias JE, Schäfer P, Busskamp V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 2021; 19:961-969. [PMID: 33613863 PMCID: PMC7873570 DOI: 10.1016/j.csbj.2021.01.015] [Citation(s) in RCA: 88] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 01/13/2021] [Accepted: 01/13/2021] [Indexed: 12/22/2022] Open
Abstract
The advent of single-cell sequencing started a new era of transcriptomic and genomic research, advancing our knowledge of the cellular heterogeneity and dynamics. Cell type annotation is a crucial step in analyzing single-cell RNA sequencing data, yet manual annotation is time-consuming and partially subjective. As an alternative, tools have been developed for automatic cell type identification. Different strategies have emerged to ultimately associate gene expression profiles of single cells with a cell type either by using curated marker gene databases, correlating reference expression data, or transferring labels by supervised classification. In this review, we present an overview of the available tools and the underlying approaches to perform automated cell type annotations on scRNA-seq data.
Collapse
Affiliation(s)
- Giovanni Pasquini
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
- Universitäts-Augenklinik Bonn, University of Bonn, Department of Ophthalmology, Bonn 53127, Germany
| | - Jesus Eduardo Rojo Arias
- Wellcome-MRC Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, UK
| | - Patrick Schäfer
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
| | - Volker Busskamp
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
- Universitäts-Augenklinik Bonn, University of Bonn, Department of Ophthalmology, Bonn 53127, Germany
| |
Collapse
|
23
|
Forcato M, Romano O, Bicciato S. Computational methods for the integrative analysis of single-cell data. Brief Bioinform 2021; 22:20-29. [PMID: 32363378 PMCID: PMC7820847 DOI: 10.1093/bib/bbaa042] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 03/05/2020] [Accepted: 01/03/2020] [Indexed: 01/05/2023] Open
Abstract
Recent advances in single-cell technologies are providing exciting opportunities for dissecting tissue heterogeneity and investigating cell identity, fate and function. This is a pristine, exploding field that is flooding biologists with a new wave of data, each with its own specificities in terms of complexity and information content. The integrative analysis of genomic data, collected at different molecular layers from diverse cell populations, holds promise to address the full-scale complexity of biological systems. However, the combination of different single-cell genomic signals is computationally challenging, as these data are intrinsically heterogeneous for experimental, technical and biological reasons. Here, we describe the computational methods for the integrative analysis of single-cell genomic data, with a focus on the integration of single-cell RNA sequencing datasets and on the joint analysis of multimodal signals from individual cells.
Collapse
Affiliation(s)
- Mattia Forcato
- Molecular Biology and Bioinformatics at the University of Modena and Reggio Emilia. His research interests include the development and application of bioinformatics methods for the analysis of next-generation sequencing data
| | - Oriana Romano
- Molecular Biology and Bioinformatics at the University of Modena and Reggio Emilia. Her research activities are mainly focused on the integrative analysis of transcriptional and epigenomic bulk and single-cell data
| | - Silvio Bicciato
- Industrial Bioengineering at the University of Modena and Reggio Emilia. His research activity is the development and application of computational approaches for the analysis of multi -omics data
| |
Collapse
|
24
|
Johnson TS, Xiang S, Helm BR, Abrams ZB, Neidecker P, Machiraju R, Zhang Y, Huang K, Zhang J. Spatial cell type composition in normal and Alzheimers human brains is revealed using integrated mouse and human single cell RNA sequencing. Sci Rep 2020; 10:18014. [PMID: 33093481 PMCID: PMC7582925 DOI: 10.1038/s41598-020-74917-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 09/16/2020] [Indexed: 12/20/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) resolves heterogenous cell populations in tissues and helps to reveal single-cell level function and dynamics. In neuroscience, the rarity of brain tissue is the bottleneck for such study. Evidence shows that, mouse and human share similar cell type gene markers. We hypothesized that the scRNA-seq data of mouse brain tissue can be used to complete human data to infer cell type composition in human samples. Here, we supplement cell type information of human scRNA-seq data, with mouse. The resulted data were used to infer the spatial cellular composition of 3702 human brain samples from Allen Human Brain Atlas. We then mapped the cell types back to corresponding brain regions. Most cell types were localized to the correct regions. We also compare the mapping results to those derived from neuronal nuclei locations. They were consistent after accounting for changes in neural connectivity between regions. Furthermore, we applied this approach on Alzheimer's brain data and successfully captured cell pattern changes in AD brains. We believe this integrative approach can solve the sample rarity issue in the neuroscience.
Collapse
Affiliation(s)
- Travis S Johnson
- Department of Biomedical Informatics, The Ohio State University, Lincoln Tower 250, 1800 Cannon Dr., Columbus, OH, 43210, USA
- Department of Medicine, Indiana University School of Medicine, Emerson Hall 305, 545 Barnhill Dr., Indianapolis, IN, 46202, USA
- Department of Biostatistics, Indiana University School of Medicine, HITS 3000, 410 W. 10th St., Indianapolis, IN, 46202, USA
| | - Shunian Xiang
- Department of Medicine, Indiana University School of Medicine, Emerson Hall 305, 545 Barnhill Dr., Indianapolis, IN, 46202, USA
| | - Bryan R Helm
- Department of Medicine, Indiana University School of Medicine, Emerson Hall 305, 545 Barnhill Dr., Indianapolis, IN, 46202, USA
| | - Zachary B Abrams
- Department of Biomedical Informatics, The Ohio State University, Lincoln Tower 250, 1800 Cannon Dr., Columbus, OH, 43210, USA
| | - Peter Neidecker
- Department of Mathematics, The Ohio State University, Math Tower 100, 231 West 18th Ave., Columbus, OH, 43210, USA
| | - Raghu Machiraju
- Department of Computer Science and Engineering, The Ohio State University, Dreese Laboratories 779, 2015 Neil Ave., Columbus, OH, 43210, USA
| | - Yan Zhang
- Department of Biomedical Informatics, The Ohio State University, Lincoln Tower 250, 1800 Cannon Dr., Columbus, OH, 43210, USA
| | - Kun Huang
- Department of Medicine, Indiana University School of Medicine, Emerson Hall 305, 545 Barnhill Dr., Indianapolis, IN, 46202, USA.
- Regenstrief Institute, 335, 1101 W. 10th St., Indianapolis, IN, 46202, USA.
- Medical and Molecular Genetics, Indiana University Purdue University Indianapolis, HITS 5015, 410 W. 10th St., Indianapolis, IN, 46202, USA.
| | - Jie Zhang
- Medical and Molecular Genetics, Indiana University Purdue University Indianapolis, HITS 5015, 410 W. 10th St., Indianapolis, IN, 46202, USA.
| |
Collapse
|
25
|
Huang S, Huang Z, Ma C, Luo L, Li YF, Wu YL, Ren Y, Feng C. Acidic leucine-rich nuclear phosphoprotein-32A expression contributes to adverse outcome in acute myeloid leukemia. ANNALS OF TRANSLATIONAL MEDICINE 2020; 8:345. [PMID: 32355789 PMCID: PMC7186738 DOI: 10.21037/atm.2020.02.54] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Background Acidic leucine-rich nuclear phosphoprotein-32A (ANP32A) is a novel regulator of histone H3 acetylation and promotes leukemogenesis in acute myeloid leukemia (AML). However, its prognostic value in AML remains unclear. Methods In this study, we evaluated the prognostic significance of ANP32A expression using two independent large cohorts of cytogenetically normal AML (CN-AML) patients. Multivariable analysis in CN-AML group was also presented. Based on the ANP32A expression, its related genes, dysregulation of pathways, interaction network analysis between microRNAs and target genes, as well as methylation analysis were performed to unveil the complex functions behind ANP32A. Results Here we demonstrated overexpression of ANP32A was notably associated with unfavorable outcome in two independent cohorts of CN-AML patients (OS: P=0.012, EFS: P=0.005, n=185; OS: P=0.041, n=232), as well as in European Leukemia Net (ELN) Intermediate-I group (OS: P=0.018, EFS: P=0.045, n=115), National Comprehensive Cancer Network (NCCN) Intermediate Risk AML group (OS: P=0.048, EFS: P=0.039, n=225), and non-M3 AML group (OS: P=0.034, EFS: P=0.011, n=435). Multivariable analysis further validated ANP32A as a high-risk factor in CN-AML group. Multi-omics analysis presented overexpression of ANP32A was associated with aberrant expression of oncogenes and tumor suppressor, up/down-regulation of metabolic and immune-related pathways, dysregulation of microRNAs, and hypomethylation on CpG island and 1st Exon regions. Conclusions We proved ANP32A as a novel, potential unfavorable prognosticator and therapeutic target for AML.
Collapse
Affiliation(s)
- Sai Huang
- Department of Hematology, First Medical Center, Chinese PLA General Hospital, Beijing 100853, China
| | - Zhi Huang
- School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
| | - Chao Ma
- Department of Hematology, First Medical Center, Chinese PLA General Hospital, Beijing 100853, China
| | - Lan Luo
- Department of Hematology, Peking University Third Hospital, Beijing 100191, China
| | - Yan-Fen Li
- Department of Hematology, First Medical Center, Chinese PLA General Hospital, Beijing 100853, China
| | - Yong-Li Wu
- Department of Hematology, First Medical Center, Chinese PLA General Hospital, Beijing 100853, China
| | - Yuan Ren
- Department of Hematology, First Medical Center, Chinese PLA General Hospital, Beijing 100853, China
| | - Cong Feng
- Department of Emergency, First Medical Center, Chinese PLA General Hospital, Beijing 100853, China
| |
Collapse
|
26
|
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini CSO, Aparicio S, Baaijens J, Balvert M, Barbanson BD, Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo TH, Lelieveldt BP, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Rączkowska A, Reinders M, Ridder JD, Saliba AE, Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in single-cell data science. Genome Biol 2020; 21:31. [PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6] [Citation(s) in RCA: 594] [Impact Index Per Article: 148.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 01/02/2020] [Indexed: 02/08/2023] Open
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Collapse
Affiliation(s)
- David Lähnemann
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf, Germany
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Johannes Köster
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA
| | - Ewa Szczurek
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Davis J. McCarthy
- Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia
- Melbourne Integrative Genomics, School of BioSciences–School of Mathematics & Statistics, Faculty of Science, University of Melbourne, Melbourne, Australia
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD USA
| | - Mark D. Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zürich, Zürich, Switzerland
| | - Catalina A. Vallejos
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK
- The Alan Turing Institute, British Library, London, UK
| | - Kieran R. Campbell
- Department of Statistics, University of British Columbia, Vancouver, Canada
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Data Science Institute, University of British Columbia, Vancouver, Canada
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Luca Pinello
- Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA
- Department of Pathology, Harvard Medical School, Boston, USA
- Broad Institute of Harvard and MIT, Cambridge, MA USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, USA
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | | | - Samuel Aparicio
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Jasmijn Baaijens
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
| | - Marleen Balvert
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| | - Buys de Barbanson
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Antonio Cappuccio
- Institute for Advanced Study, University of Amsterdam, Amsterdam, The Netherlands
| | - Giacomo Corleone
- Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, London, UK
| | - Bas E. Dutilh
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Maria Florescu
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Rens Holmer
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Katharina Jahn
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Thamar Jessurun Lobo
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Emma M. Keizer
- Biometris, Wageningen University & Research, Wageningen, The Netherlands
| | - Indu Khatri
- Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, Leiden, The Netherlands
| | - Szymon M. Kielbasa
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Jan O. Korbel
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alexey M. Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tzu-Hao Kuo
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Boudewijn P.F. Lelieveldt
- PRB lab, Delft University of Technology, Delft, The Netherlands
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Ion I. Mandoiu
- Computer Science & Engineering Department, University of Connecticut, Storrs, USA
| | - John C. Marioni
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Felix Mölder
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
| | - Amir Niknejad
- Computation molecular design, Zuse Institute Berlin, Berlin, Germany
- Mathematics Department, Mount Saint Vincent, New York, USA
| | - Alicja Rączkowska
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Marcel Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Antoine-Emmanuel Saliba
- Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany
| | - Antonios Somarakis
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Oliver Stegle
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center–DKFZ, Heidelberg, Germany
| | - Fabian J. Theis
- Institute of Computational Biology, Helmholtz Zentrum München–German Research Center for Environmental Health, Neuherberg, Germany
| | - Huan Yang
- Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research–LACDR–Leiden University, Leiden, The Netherlands
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Alice C. McHardy
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Sohrab P. Shah
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Alexander Schönhuth
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
27
|
Mieth B, Hockley JRF, Görnitz N, Vidovic MMC, Müller KR, Gutteridge A, Ziemek D. Using transfer learning from prior reference knowledge to improve the clustering of single-cell RNA-Seq data. Sci Rep 2019; 9:20353. [PMID: 31889137 PMCID: PMC6937257 DOI: 10.1038/s41598-019-56911-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 12/13/2019] [Indexed: 01/21/2023] Open
Abstract
In many research areas scientists are interested in clustering objects within small datasets while making use of prior knowledge from large reference datasets. We propose a method to apply the machine learning concept of transfer learning to unsupervised clustering problems and show its effectiveness in the field of single-cell RNA sequencing (scRNA-Seq). The goal of scRNA-Seq experiments is often the definition and cataloguing of cell types from the transcriptional output of individual cells. To improve the clustering of small disease- or tissue-specific datasets, for which the identification of rare cell types is often problematic, we propose a transfer learning method to utilize large and well-annotated reference datasets, such as those produced by the Human Cell Atlas. Our approach modifies the dataset of interest while incorporating key information from the larger reference dataset via Non-negative Matrix Factorization (NMF). The modified dataset is subsequently provided to a clustering algorithm. We empirically evaluate the benefits of our approach on simulated scRNA-Seq data as well as on publicly available datasets. Finally, we present results for the analysis of a recently published small dataset and find improved clustering when transferring knowledge from a large reference dataset. Implementations of the method are available at https://github.com/nicococo/scRNA.
Collapse
Affiliation(s)
- Bettina Mieth
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - James R F Hockley
- Department of Pharmacology, University of Cambridge, Cambridge, CB2 1PD, United Kingdom
- GlaxoSmithKline, Stevenage, SG1 2NY, United Kingdom
| | - Nico Görnitz
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Marina M-C Vidovic
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany.
- Department of Brain and Cognitive Engineering, Korea University, Seoul, 02841, Republic of Korea.
- Max Planck Institute for Informatics, Saarbrücken, 66123, Germany.
| | | | - Daniel Ziemek
- Pfizer, Worldwide Research and Development, Berlin, 10785, Germany.
| |
Collapse
|
28
|
Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, Mahfouz A. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol 2019; 20:194. [PMID: 31500660 PMCID: PMC6734286 DOI: 10.1186/s13059-019-1795-z] [Citation(s) in RCA: 315] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 08/17/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. RESULTS Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods' sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. CONCLUSIONS We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub ( https://github.com/tabdelaal/scRNAseq_Benchmark ). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.
Collapse
Affiliation(s)
- Tamim Abdelaal
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| | - Lieke Michielsen
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| | - Davy Cats
- Sequencing Analysis Support Core, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
| | - Dylan Hoogduin
- Sequencing Analysis Support Core, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
| | - Hailiang Mei
- Sequencing Analysis Support Core, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
| | - Marcel J. T. Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| |
Collapse
|
29
|
Wang T, Johnson TS, Shao W, Lu Z, Helm BR, Zhang J, Huang K. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol 2019; 20:165. [PMID: 31405383 PMCID: PMC6691531 DOI: 10.1186/s13059-019-1764-6] [Citation(s) in RCA: 71] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 07/17/2019] [Indexed: 12/21/2022] Open
Abstract
To fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for identifying cell lineages and bona fide transcriptional signals, it is necessary to combine data from multiple experiments. We present BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data. BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate that BERMUDA outperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets.
Collapse
Affiliation(s)
- Tongxin Wang
- Department of Computer Science, Indiana University Bloomington, Bloomington, IN, USA
| | - Travis S Johnson
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Wei Shao
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Zixiao Lu
- Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou, China
| | - Bryan R Helm
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Jie Zhang
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA.
| | - Kun Huang
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA.
- Regenstrief Institute, Indianapolis, IN, USA.
| |
Collapse
|