1
|
Hrovatin K, Sikkema L, Shitov VA, Heimberg G, Shulman M, Oliver AJ, Mueller MF, Ibarra IL, Wang H, Ramírez-Suástegui C, He P, Schaar AC, Teichmann SA, Theis FJ, Luecken MD. Considerations for building and using integrated single-cell atlases. Nat Methods 2025; 22:41-57. [PMID: 39672979 DOI: 10.1038/s41592-024-02532-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 10/22/2024] [Indexed: 12/15/2024]
Abstract
The rapid adoption of single-cell technologies has created an opportunity to build single-cell 'atlases' integrating diverse datasets across many laboratories. Such atlases can serve as a reference for analyzing and interpreting current and future data. However, it has become apparent that atlasing approaches differ, and the impact of these differences are often unclear. Here we review the current atlasing literature and present considerations for building and using atlases. Importantly, we find that no one-size-fits-all protocol for atlas building exists, but rather we discuss context-specific considerations and workflows, including atlas conceptualization, data collection, curation and integration, atlas evaluation and atlas sharing. We further highlight the benefits of integrated atlases for analyses of new datasets and deriving biological insights beyond what is possible from individual datasets. Our overview of current practices and associated recommendations will improve the quality of atlases to come, facilitating the shift to a unified, reference-based understanding of single-cell biology.
Collapse
Affiliation(s)
- Karin Hrovatin
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Lisa Sikkema
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Vladimir A Shitov
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Comprehensive Pneumology Center (CPC) with the CPC-M bioArchive / Institute of Lung Health and Immunity (LHI), Helmholtz Zentrum München; Member of the German Center for Lung Research (DZL), Munich, Germany
| | - Graham Heimberg
- Department of OMNI Bioinformatics, Genentech, South San Francisco, CA, USA
- Department of Biological Research | AI Development, Genentech, South San Francisco, CA, USA
| | - Maiia Shulman
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Amanda J Oliver
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Michaela F Mueller
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Ignacio L Ibarra
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Hanchen Wang
- Department of Biological Research | AI Development, Genentech, South San Francisco, CA, USA
- Department of Computer Science, Stanford University, Palo Alto, CA, USA
| | - Ciro Ramírez-Suástegui
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Peng He
- Department of Pathology, University of California, San Francisco, San Francisco, CA, USA
| | - Anna C Schaar
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Sarah A Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
- Theory of Condensed Matter Group, Department of Physics, Cavendish Laboratory, University of Cambridge, Cambridge, UK
- Cambridge Stem Cell Institute and Department of Medicine, University of Cambridge, Cambridge, UK
- CIFAR MacMillan Multiscale Human Programme, Toronto, Ontario, Canada
| | - Fabian J Theis
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
- Department of Mathematics, Technical University of Munich, Garching, Germany.
| | - Malte D Luecken
- Department of Computational Health, Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- Comprehensive Pneumology Center (CPC) with the CPC-M bioArchive / Institute of Lung Health and Immunity (LHI), Helmholtz Zentrum München; Member of the German Center for Lung Research (DZL), Munich, Germany.
| |
Collapse
|
2
|
Lan W, Ling T, Chen Q, Zheng R, Li M, Pan Y. scMoMtF: An interpretable multitask learning framework for single-cell multi-omics data analysis. PLoS Comput Biol 2024; 20:e1012679. [PMID: 39693287 DOI: 10.1371/journal.pcbi.1012679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 11/26/2024] [Indexed: 12/20/2024] Open
Abstract
With the rapidly development of biotechnology, it is now possible to obtain single-cell multi-omics data in the same cell. However, how to integrate and analyze these single-cell multi-omics data remains a great challenge. Herein, we introduce an interpretable multitask framework (scMoMtF) for comprehensively analyzing single-cell multi-omics data. The scMoMtF can simultaneously solve multiple key tasks of single-cell multi-omics data including dimension reduction, cell classification and data simulation. The experimental results shows that scMoMtF outperforms current state-of-the-art algorithms on these tasks. In addition, scMoMtF has interpretability which allowing researchers to gain a reliable understanding of potential biological features and mechanisms in single-cell multi-omics data.
Collapse
Affiliation(s)
- Wei Lan
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of computer, electronic and information, Guangxi university, Nanning, Guangxi, China
| | - Tongsheng Ling
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of computer, electronic and information, Guangxi university, Nanning, Guangxi, China
| | - Qingfeng Chen
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of computer, electronic and information, Guangxi university, Nanning, Guangxi, China
| | - Ruiqing Zheng
- School of computer and engineering, Central South University, Changsha, Hunan, China
| | - Min Li
- School of computer and engineering, Central South University, Changsha, Hunan, China
| | - Yi Pan
- School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Hu Y, Wan S, Luo Y, Li Y, Wu T, Deng W, Jiang C, Jiang S, Zhang Y, Liu N, Yang Z, Chen F, Li B, Qu K. Benchmarking algorithms for single-cell multi-omics prediction and integration. Nat Methods 2024; 21:2182-2194. [PMID: 39322753 DOI: 10.1038/s41592-024-02429-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Accepted: 08/19/2024] [Indexed: 09/27/2024]
Abstract
The development of single-cell multi-omics technology has greatly enhanced our understanding of biology, and in parallel, numerous algorithms have been proposed to predict the protein abundance and/or chromatin accessibility of cells from single-cell transcriptomic information and to integrate various types of single-cell multi-omics data. However, few studies have systematically compared and evaluated the performance of these algorithms. Here, we present a benchmark study of 14 protein abundance/chromatin accessibility prediction algorithms and 18 single-cell multi-omics integration algorithms using 47 single-cell multi-omics datasets. Our benchmark study showed overall totalVI and scArches outperformed the other algorithms for predicting protein abundance, and LS_Lab was the top-performing algorithm for the prediction of chromatin accessibility in most cases. Seurat, MOJITOO and scAI emerge as leading algorithms for vertical integration, whereas totalVI and UINMF excel beyond their counterparts in both horizontal and mosaic integration scenarios. Additionally, we provide a pipeline to assist researchers in selecting the optimal multi-omics prediction and integration algorithm.
Collapse
Affiliation(s)
- Yinlei Hu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
- School of Mathematical Science, University of Science and Technology of China, Hefei, China
| | - Siyuan Wan
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China
| | - Yuanhanyu Luo
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China
- National Institute of Biological Sciences, Beijing, China
| | - Yuanzhe Li
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China
| | - Tong Wu
- National Institute of Biological Sciences, Beijing, China
- College of Life Sciences, Beijing Normal University, Beijing, China
| | - Wentao Deng
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
| | - Chen Jiang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
| | - Shan Jiang
- National Institute of Biological Sciences, Beijing, China
| | - Yueping Zhang
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China
| | - Nianping Liu
- School of Biomedical Engineering, Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
| | - Zongcheng Yang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Falai Chen
- School of Mathematical Science, University of Science and Technology of China, Hefei, China.
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China.
| | - Bin Li
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China.
- National Institute of Biological Sciences, Beijing, China.
| | - Kun Qu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China.
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China.
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China.
- School of Biomedical Engineering, Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China.
| |
Collapse
|
4
|
Zhang Z, Zhang X. Data-driven batch detection enhances single-cell omics data analysis. Cell Syst 2024; 15:893-894. [PMID: 39419000 DOI: 10.1016/j.cels.2024.09.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 09/23/2024] [Accepted: 09/23/2024] [Indexed: 10/19/2024]
Abstract
In single-cell omics studies, data are typically collected across multiple batches, resulting in batch effects: technical confounders that introduce noise and distort data distribution. Correcting these effects is challenging due to their unknown sources, nonlinear distortions, and the difficulty of accurately assigning data to batches that are optimal for integration methods.
Collapse
Affiliation(s)
- Ziqi Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Xiuwei Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
5
|
He Z, Hu S, Chen Y, An S, Zhou J, Liu R, Shi J, Wang J, Dong G, Shi J, Zhao J, Ou-Yang L, Zhu Y, Bo X, Ying X. Mosaic integration and knowledge transfer of single-cell multimodal data with MIDAS. Nat Biotechnol 2024; 42:1594-1605. [PMID: 38263515 PMCID: PMC11471558 DOI: 10.1038/s41587-023-02040-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 10/23/2023] [Indexed: 01/25/2024]
Abstract
Integrating single-cell datasets produced by multiple omics technologies is essential for defining cellular heterogeneity. Mosaic integration, in which different datasets share only some of the measured modalities, poses major challenges, particularly regarding modality alignment and batch effect removal. Here, we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to 19 other methods and reliability by evaluating its performance in trimodal and mosaic integration tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data. Applications in mosaic integration, pseudotime analysis and cross-tissue knowledge transfer on bone marrow mosaic datasets demonstrate the versatility and superiority of MIDAS. MIDAS is available at https://github.com/labomics/midas .
Collapse
Affiliation(s)
- Zhen He
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Shuofeng Hu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Yaowen Chen
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Sijing An
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jiahao Zhou
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Runyan Liu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Junfeng Shi
- School of Automation, China University of Geosciences, Wuhan, China
| | - Jing Wang
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Guohua Dong
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jinhui Shi
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jiaxin Zhao
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
| | - Xiaochen Bo
- Institute of Health Service and Transfusion Medicine, Beijing, China.
| | - Xiaomin Ying
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China.
| |
Collapse
|
6
|
Rong Z, Song J, Yu Y, Mi L, Qiu M, Song Y, Hou Y. Single-cell mosaic integration and cell state transfer with auto-scaling self-attention mechanism. Brief Bioinform 2024; 25:bbae540. [PMID: 39438079 PMCID: PMC11495875 DOI: 10.1093/bib/bbae540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 09/02/2024] [Accepted: 10/10/2024] [Indexed: 10/25/2024] Open
Abstract
The integration of data from multiple modalities generated by single-cell omics technologies is crucial for accurately identifying cell states. One challenge in comprehending multi-omics data resides in mosaic integration, in which different data modalities are profiled in different subsets of cells, as it requires simultaneous batch effect removal and modality alignment. Here, we develop Multi-omics Mosaic Auto-scaling Attention Variational Inference (mmAAVI), a scalable deep generative model for single-cell mosaic integration. Leveraging auto-scaling self-attention mechanisms, mmAAVI can map arbitrary combinations of omics to the common embedding space. If existing well-annotated cell states, the model can perform semisupervised learning to utilize existing these annotations. We validated the performance of mmAAVI and five other commonly used methods on four benchmark datasets, which vary in cell numbers, omics types, and missing patterns. mmAAVI consistently demonstrated its superiority. We also validated mmAAVI's ability for cell state knowledge transfer, achieving balanced accuracies of 0.82 and 0.97 with less 1% labeled cells between batches with completely different omics. The full package is available at https://github.com/luyiyun/mmAAVI.
Collapse
Affiliation(s)
- Zhiwei Rong
- Department of Biostatistics, School of Public Health, Peking University, 38 Xueyuan Rd., Haidian District, Beijing 100191, China
| | - Jiali Song
- Department of Biostatistics, School of Public Health, Peking University, 38 Xueyuan Rd., Haidian District, Beijing 100191, China
| | - Yipei Yu
- Department of Biostatistics, School of Public Health, Peking University, 38 Xueyuan Rd., Haidian District, Beijing 100191, China
| | - Lan Mi
- Peking University Cancer Hospital, 52 Fucheng Rd., Haidian District, Beijing 100142, China
| | - ManTang Qiu
- Department of Thoracic Surgery, Peking University People’s Hospital, No. 11 Xizhimen South Street, Xicheng District, Beijing 100044, China
| | - Yuqin Song
- Peking University Cancer Hospital, 52 Fucheng Rd., Haidian District, Beijing 100142, China
| | - Yan Hou
- Department of Biostatistics, School of Public Health, Peking University, 38 Xueyuan Rd., Haidian District, Beijing 100191, China
- Peking University Cancer Hospital, 52 Fucheng Rd., Haidian District, Beijing 100142, China
- Peking University Clinical Research Center, Peking University, 38 Xueyuan Rd., Haidian District, Beijing 100191, China
| |
Collapse
|
7
|
Zhao K, So HC, Lin Z. scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis. Genome Biol 2024; 25:223. [PMID: 39152499 PMCID: PMC11328435 DOI: 10.1186/s13059-024-03345-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 07/23/2024] [Indexed: 08/19/2024] Open
Abstract
The rapid rise in the availability and scale of scRNA-seq data needs scalable methods for integrative analysis. Though many methods for data integration have been developed, few focus on understanding the heterogeneous effects of biological conditions across different cell populations in integrative analysis. Our proposed scalable approach, scParser, models the heterogeneous effects from biological conditions, which unveils the key mechanisms by which gene expression contributes to phenotypes. Notably, the extended scParser pinpoints biological processes in cell subpopulations that contribute to disease pathogenesis. scParser achieves favorable performance in cell clustering compared to state-of-the-art methods and has a broad and diverse applicability.
Collapse
Affiliation(s)
- Kai Zhao
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Hon-Cheong So
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, Kunming Institute of Zoology and The Chinese University of Hong Kong, Hong Kong SAR, China.
- Department of Psychiatry, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- Margaret K.L. Cheung Research Centre for Management of Parkinsonism, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- Brain and Mind Institute, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- Hong Kong Branch of the Chinese Academy of Sciences Center for Excellence in Animal Evolution and Genetics, The Chinese University of Hong Kong, Hong Kong SAR, China.
| | - Zhixiang Lin
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
| |
Collapse
|
8
|
Qiao C, Huang Y. Reliable imputation of spatial transcriptomes with uncertainty estimation and spatial regularization. PATTERNS (NEW YORK, N.Y.) 2024; 5:101021. [PMID: 39233691 PMCID: PMC11368697 DOI: 10.1016/j.patter.2024.101021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 05/14/2024] [Accepted: 06/11/2024] [Indexed: 09/06/2024]
Abstract
Imputation of missing features in spatial transcriptomics is urgently needed due to technological limitations. However, most existing computational methods suffer from moderate accuracy and cannot estimate the reliability of the imputation. To fill this research gap, we introduce a computational model, TransImpute, that imputes the missing feature modality in spatial transcriptomics by mapping it from single-cell reference data. We derive a set of attributes that can accurately predict imputation uncertainty, enabling us to select reliably imputed genes. In addition, we introduce a spatial autocorrelation metric as a regularization to avoid overestimating spatial patterns. Multiple datasets from various platforms demonstrate that our approach significantly improves the reliability of downstream analyses in detecting spatial variable genes and interacting ligand-receptor pairs. Therefore, TransImpute offers a reliable approach to spatial analysis of missing features for both matched and unseen modalities, such as nascent RNAs.
Collapse
Affiliation(s)
- Chen Qiao
- School of Biomedical Sciences, University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, University of Hong Kong, Pokfulam, Hong Kong SAR, China
- Department of Statistics and Actuarial Science, University of Hong Kong, Pokfulam, Hong Kong SAR, China
- Center for Translational Stem Cell Biology, Hong Kong Science and Technology Park, Hong Kong SAR, China
| |
Collapse
|
9
|
Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, Wang B. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 2024; 21:1470-1480. [PMID: 38409223 DOI: 10.1038/s41592-024-02201-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 01/30/2024] [Indexed: 02/28/2024]
Abstract
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.
Collapse
Affiliation(s)
- Haotian Cui
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Chloe Wang
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Hassaan Maan
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada
- Vector Institute, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
| | - Kuan Pang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Fengning Luo
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Nan Duan
- Microsoft Research, Redmond, WA, USA
| | - Bo Wang
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Vector Institute, Toronto, Ontario, Canada.
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada.
- AI Hub, University Health Network, Toronto, Ontario, Canada.
| |
Collapse
|
10
|
Verhey TB, Seo H, Gillmor A, Thoppey-Manoharan V, Schriemer D, Morrissy S. mosaicMPI: a framework for modular data integration across cohorts and -omics modalities. Nucleic Acids Res 2024; 52:e53. [PMID: 38813827 PMCID: PMC11229337 DOI: 10.1093/nar/gkae442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 04/26/2024] [Accepted: 05/10/2024] [Indexed: 05/31/2024] Open
Abstract
Advances in molecular profiling have facilitated generation of large multi-modal datasets that can potentially reveal critical axes of biological variation underlying complex diseases. Distilling biological meaning, however, requires computational strategies that can perform mosaic integration across diverse cohorts and datatypes. Here, we present mosaicMPI, a framework for discovery of low to high-resolution molecular programs representing both cell types and states, and integration within and across datasets into a network representing biological themes. Using existing datasets in glioblastoma, we demonstrate that this approach robustly integrates single cell and bulk programs across multiple platforms. Clinical and molecular annotations from cohorts are statistically propagated onto this network of programs, yielding a richly characterized landscape of biological themes. This enables deep understanding of individual tumor samples, systematic exploration of relationships between modalities, and generation of a reference map onto which new datasets can rapidly be mapped. mosaicMPI is available at https://github.com/MorrissyLab/mosaicMPI.
Collapse
Affiliation(s)
- Theodore B Verhey
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, Alberta, Canada
- Charbonneau Cancer institute, University of Calgary, Calgary, Alberta, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, Canada
| | - Heewon Seo
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, Alberta, Canada
- Charbonneau Cancer institute, University of Calgary, Calgary, Alberta, Canada
| | - Aaron Gillmor
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, Alberta, Canada
- Charbonneau Cancer institute, University of Calgary, Calgary, Alberta, Canada
| | - Varsha Thoppey-Manoharan
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, Alberta, Canada
- Charbonneau Cancer institute, University of Calgary, Calgary, Alberta, Canada
| | - David Schriemer
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, Alberta, Canada
- Charbonneau Cancer institute, University of Calgary, Calgary, Alberta, Canada
| | - Sorana Morrissy
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, Alberta, Canada
- Charbonneau Cancer institute, University of Calgary, Calgary, Alberta, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, Canada
| |
Collapse
|
11
|
Rautenstrauch P, Ohler U. Liam tackles complex multimodal single-cell data integration challenges. Nucleic Acids Res 2024; 52:e52. [PMID: 38842910 PMCID: PMC11229356 DOI: 10.1093/nar/gkae409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 03/08/2024] [Accepted: 05/29/2024] [Indexed: 06/07/2024] Open
Abstract
Multi-omics characterization of single cells holds outstanding potential for profiling the dynamics and relations of gene regulatory states of thousands of cells. How to integrate multimodal data is an open problem, especially when aiming to combine data from multiple sources or conditions containing both biological and technical variation. We introduce liam, a flexible model for the simultaneous horizontal and vertical integration of paired single-cell multimodal data and mosaic integration of paired with unimodal data. Liam learns a joint low-dimensional representation of the measured modalities, which proves beneficial when the information content or quality of the modalities differ. Its integration accounts for complex batch effects using a tunable combination of conditional and adversarial training, which can be optimized using replicate information while retaining selected biological variation. We demonstrate liam's superior performance on multiple paired multimodal data types, including Multiome and CITE-seq data, and in mosaic integration scenarios. Our detailed benchmarking experiments illustrate the complexities and challenges remaining for integration and the meaningful assessment of its success.
Collapse
Affiliation(s)
- Pia Rautenstrauch
- Humboldt-Universität zu Berlin, Department of Computer Science, 10099 Berlin, Germany
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany
| | - Uwe Ohler
- Humboldt-Universität zu Berlin, Department of Computer Science, 10099 Berlin, Germany
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany
- Humboldt-Universität zu Berlin, Department of Biology, 10099 Berlin, Germany
| |
Collapse
|
12
|
Wang L, Nie R, Miao X, Cai Y, Wang A, Zhang H, Zhang J, Cai J. InClust+: the deep generative framework with mask modules for multimodal data integration, imputation, and cross-modal generation. BMC Bioinformatics 2024; 25:41. [PMID: 38267858 PMCID: PMC10809631 DOI: 10.1186/s12859-024-05656-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 01/15/2024] [Indexed: 01/26/2024] Open
Abstract
BACKGROUND With the development of single-cell technology, many cell traits can be measured. Furthermore, the multi-omics profiling technology could jointly measure two or more traits in a single cell simultaneously. In order to process the various data accumulated rapidly, computational methods for multimodal data integration are needed. RESULTS Here, we present inClust+, a deep generative framework for the multi-omics. It's built on previous inClust that is specific for transcriptome data, and augmented with two mask modules designed for multimodal data processing: an input-mask module in front of the encoder and an output-mask module behind the decoder. InClust+ was first used to integrate scRNA-seq and MERFISH data from similar cell populations, and to impute MERFISH data based on scRNA-seq data. Then, inClust+ was shown to have the capability to integrate the multimodal data (e.g. tri-modal data with gene expression, chromatin accessibility and protein abundance) with batch effect. Finally, inClust+ was used to integrate an unlabeled monomodal scRNA-seq dataset and two labeled multimodal CITE-seq datasets, transfer labels from CITE-seq datasets to scRNA-seq dataset, and generate the missing modality of protein abundance in monomodal scRNA-seq data. In the above examples, the performance of inClust+ is better than or comparable to the most recent tools in the corresponding task. CONCLUSIONS The inClust+ is a suitable framework for handling multimodal data. Meanwhile, the successful implementation of mask in inClust+ means that it can be applied to other deep learning methods with similar encoder-decoder architecture to broaden the application scope of these models.
Collapse
Affiliation(s)
- Lifei Wang
- Shulan (Hangzhou) Hospital, Affiliated to Zhejiang Shuren University Shulan International Medical College, Hangzhou, China.
| | - Rui Nie
- China National Center for Bioinformation, Beijing, China
- Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Xuexia Miao
- China National Center for Bioinformation, Beijing, China
- Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Yankai Cai
- School of Economic and Management, China University of Geoscience, Wuhan, China
| | - Anqi Wang
- Shulan (Hangzhou) Hospital, Affiliated to Zhejiang Shuren University Shulan International Medical College, Hangzhou, China
| | - Hanwen Zhang
- Shulan (Hangzhou) Hospital, Affiliated to Zhejiang Shuren University Shulan International Medical College, Hangzhou, China
| | - Jiang Zhang
- School of Systems Science, Beijing Normal University, Beijing, 100875, China.
| | - Jun Cai
- China National Center for Bioinformation, Beijing, China.
- Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
13
|
Lee MYY, Kaestner KH, Li M. Benchmarking algorithms for joint integration of unpaired and paired single-cell RNA-seq and ATAC-seq data. Genome Biol 2023; 24:244. [PMID: 37875977 PMCID: PMC10594700 DOI: 10.1186/s13059-023-03073-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 09/25/2023] [Indexed: 10/26/2023] Open
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) measures gene expression in single cells, while single-nucleus ATAC-sequencing (snATAC-seq) quantifies chromatin accessibility in single nuclei. These two data types provide complementary information for deciphering cell types and states. However, when analyzed individually, they sometimes produce conflicting results regarding cell type/state assignment. The power is compromised since the two modalities reflect the same underlying biology. Recently, it has become possible to measure both gene expression and chromatin accessibility from the same nucleus. Such paired data enable the direct modeling of the relationships between the two modalities. Given the availability of the vast amount of single-modality data, it is desirable to integrate the paired and unpaired single-modality datasets to gain a comprehensive view of the cellular complexity. RESULTS We benchmark nine existing single-cell multi-omic data integration methods. Specifically, we evaluate to what extent the multiome data provide additional guidance for analyzing the existing single-modality data, and whether these methods uncover peak-gene associations from single-modality data. Our results indicate that multiome data are helpful for annotating single-modality data. However, we emphasize that the availability of an adequate number of nuclei in the multiome dataset is crucial for achieving accurate cell type annotation. Insufficient representation of nuclei may compromise the reliability of the annotations. Additionally, when generating a multiome dataset, the number of cells is more important than sequencing depth for cell type annotation. CONCLUSIONS Seurat v4 is the best currently available platform for integrating scRNA-seq, snATAC-seq, and multiome data even in the presence of complex batch effects.
Collapse
Affiliation(s)
- Michelle Y Y Lee
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Graduate Group in Genomics and Computational Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Philadelphia, PA, 19104, USA
| | - Klaus H Kaestner
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| | - Mingyao Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|