1
|
Tang Z, Chen G, Chen S, Yao J, You L, Chen CYC. Modal-nexus auto-encoder for multi-modality cellular data integration and imputation. Nat Commun 2024; 15:9021. [PMID: 39424861 PMCID: PMC11489673 DOI: 10.1038/s41467-024-53355-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 10/02/2024] [Indexed: 10/21/2024] Open
Abstract
Heterogeneous feature spaces and technical noise hinder the cellular data integration and imputation. The high cost of obtaining matched data across modalities further restricts analysis. Thus, there's a critical need for deep learning approaches to effectively integrate and impute unpaired multi-modality single-cell data, enabling deeper insights into cellular behaviors. To address these issues, we introduce the Modal-Nexus Auto-Encoder (Monae). Leveraging regulatory relationships between modalities and employing contrastive learning within modality-specific auto-encoders, Monae enhances cell representations in the unified space. The integration capability of Monae furnishes it with modality-complementary cellular representations, enabling the generation of precise intra-modal and cross-modal imputation counts for extensive and complex downstream tasks. In addition, we develop Monae-E (Monae-Extension), a variant of Monae that can converge rapidly and support biological discoveries. Evaluations on various datasets have validated Monae and Monae-E's accuracy and robustness in multi-modality cellular data integration and imputation.
Collapse
Affiliation(s)
- Zhenchao Tang
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Shouzhi Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | | | - Linlin You
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China.
| | - Calvin Yu-Chian Chen
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Genomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan.
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, 41354, Taiwan.
- Guangdong L-Med Biotechnology Co., Ltd., Meizhou, 514699, China.
| |
Collapse
|
2
|
Li L, Dannenfelser R, Cruz C, Yao V. A best-match approach for gene set analyses in embedding spaces. Genome Res 2024; 34:1421-1433. [PMID: 39231608 PMCID: PMC11529866 DOI: 10.1101/gr.279141.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 08/29/2024] [Indexed: 09/06/2024]
Abstract
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
Collapse
Affiliation(s)
- Lechuan Li
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Ruth Dannenfelser
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Charlie Cruz
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
3
|
Zhang X, Qian K, Li H. Structure-preserved integration of scRNA-seq data using heterogeneous graph neural network. Brief Bioinform 2024; 25:bbae538. [PMID: 39446194 PMCID: PMC11500609 DOI: 10.1093/bib/bbae538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 09/23/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
The integration of single-cell RNA sequencing (scRNA-seq) data from multiple experimental batches enables more comprehensive characterizations of cell states. Given that existing methods disregard the structural information between cells and genes, we proposed a structure-preserved scRNA-seq data integration approach using heterogeneous graph neural network (scHetG). By establishing a heterogeneous graph that represents the interactions between multiple batches of cells and genes, and combining a heterogeneous graph neural network with contrastive learning, scHetG concurrently obtained cell and gene embeddings with structural information. A comprehensive assessment covering different species, tissues and scales indicated that scHetG is an efficacious method for eliminating batch effects while preserving the structural information of cells and genes, including batch-specific cell types and cell-type specific gene co-expression patterns.
Collapse
Affiliation(s)
- Xun Zhang
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| |
Collapse
|
4
|
Wang Z, Luo P, Xiao M, Wang B, Liu T, Sun X. Recover then aggregate: unified cross-modal deep clustering with global structural information for single-cell data. Brief Bioinform 2024; 25:bbae485. [PMID: 39356327 PMCID: PMC11445907 DOI: 10.1093/bib/bbae485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 08/24/2024] [Accepted: 09/13/2024] [Indexed: 10/03/2024] Open
Abstract
Single-cell cross-modal joint clustering has been extensively utilized to investigate the tumor microenvironment. Although numerous approaches have been suggested, accurate clustering remains the main challenge. First, the gene expression matrix frequently contains numerous missing values due to measurement limitations. The majority of existing clustering methods treat it as a typical multi-modal dataset without further processing. Few methods conduct recovery before clustering and do not sufficiently engage with the underlying research, leading to suboptimal outcomes. Additionally, the existing cross-modal information fusion strategy does not ensure consistency of representations across different modes, potentially leading to the integration of conflicting information, which could degrade performance. To address these challenges, we propose the 'Recover then Aggregate' strategy and introduce the Unified Cross-Modal Deep Clustering model. Specifically, we have developed a data augmentation technique based on neighborhood similarity, iteratively imposing rank constraints on the Laplacian matrix, thus updating the similarity matrix and recovering dropout events. Concurrently, we integrate cross-modal features and employ contrastive learning to align modality-specific representations with consistent ones, enhancing the effective integration of diverse modal information. Comprehensive experiments on five real-world multi-modal datasets have demonstrated this method's superior effectiveness in single-cell clustering tasks.
Collapse
Affiliation(s)
- Ziyi Wang
- Department of Surgical Oncology and General Surgery, First Hospital of China Medical University, Shenyang 110001, PR China
- Section of Esophageal and Mediastinal Oncology, Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China
- Department of Thoracic Surgery, The First Hospital of China Medical University, No.155 North Nanjing Street, Shenyang 110001, People’s Republic of China
| | - Peng Luo
- Department of Thoracic Surgery, Xinqiao Hospital, Army Medical University, Chongqing 400038, China
| | - Mingming Xiao
- Department of Pathology, People’s Hospital of China Medical University (Liaoning Provincial People’s Hospital), Shenyang, Liaoning Province 110015, People’s Republic of China
| | - Boyang Wang
- Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607, United States
| | - Tianyu Liu
- Computer Science and Engineering, University of California, Riverside, Riverside, CA 92521, United States
| | - Xiangyu Sun
- Cancer Hospital of China Medical University, Liaoning Cancer Hospital and Institute, Shenyang 110042, Liaoning, China
- Cancer Hospital of Dalian University of Technology, Shenyang, Liaoning Province 110042, China
| |
Collapse
|
5
|
Loers JU, Vermeirssen V. A single-cell multimodal view on gene regulatory network inference from transcriptomics and chromatin accessibility data. Brief Bioinform 2024; 25:bbae382. [PMID: 39207727 PMCID: PMC11359808 DOI: 10.1093/bib/bbae382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/27/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
Eukaryotic gene regulation is a combinatorial, dynamic, and quantitative process that plays a vital role in development and disease and can be modeled at a systems level in gene regulatory networks (GRNs). The wealth of multi-omics data measured on the same samples and even on the same cells has lifted the field of GRN inference to the next stage. Combinations of (single-cell) transcriptomics and chromatin accessibility allow the prediction of fine-grained regulatory programs that go beyond mere correlation of transcription factor and target gene expression, with enhancer GRNs (eGRNs) modeling molecular interactions between transcription factors, regulatory elements, and target genes. In this review, we highlight the key components for successful (e)GRN inference from (sc)RNA-seq and (sc)ATAC-seq data exemplified by state-of-the-art methods as well as open challenges and future developments. Moreover, we address preprocessing strategies, metacell generation and computational omics pairing, transcription factor binding site detection, and linear and three-dimensional approaches to identify chromatin interactions as well as dynamic and causal eGRN inference. We believe that the integration of transcriptomics together with epigenomics data at a single-cell level is the new standard for mechanistic network inference, and that it can be further advanced with integrating additional omics layers and spatiotemporal data, as well as with shifting the focus towards more quantitative and causal modeling strategies.
Collapse
Affiliation(s)
- Jens Uwe Loers
- Lab for Computational Biology, Integromics and Gene Regulation (CBIGR), Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Zwijnaarde-Technologiepark 71, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Vanessa Vermeirssen
- Lab for Computational Biology, Integromics and Gene Regulation (CBIGR), Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Zwijnaarde-Technologiepark 71, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| |
Collapse
|
6
|
What's in a method name? Nat Methods 2024; 21:923. [PMID: 38866986 DOI: 10.1038/s41592-024-02323-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
|
7
|
Tayyebi Z, Pine AR, Leslie CS. Scalable and unbiased sequence-informed embedding of single-cell ATAC-seq data with CellSpace. Nat Methods 2024; 21:1014-1022. [PMID: 38724693 PMCID: PMC11166566 DOI: 10.1038/s41592-024-02274-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 04/11/2024] [Indexed: 06/13/2024]
Abstract
Standard scATAC sequencing (scATAC-seq) analysis pipelines represent cells as sparse numeric vectors relative to an atlas of peaks or genomic tiles and consequently ignore genomic sequence information at accessible loci. Here we present CellSpace, an efficient and scalable sequence-informed embedding algorithm for scATAC-seq that learns a mapping of DNA k-mers and cells to the same space, to address this limitation. We show that CellSpace captures meaningful latent structure in scATAC-seq datasets, including cell subpopulations and developmental hierarchies, and can score transcription factor activities in single cells based on proximity to binding motifs embedded in the same space. Importantly, CellSpace implicitly mitigates batch effects arising from multiple samples, donors or assays, even when individual datasets are processed relative to different peak atlases. Thus, CellSpace provides a powerful tool for integrating and interpreting large-scale scATAC-seq compendia.
Collapse
Affiliation(s)
- Zakieh Tayyebi
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Tri-Institutional Training Program in Computational Biology and Medicine, New York, NY, USA
| | - Allison R Pine
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Tri-Institutional Training Program in Computational Biology and Medicine, New York, NY, USA
| | - Christina S Leslie
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
| |
Collapse
|
8
|
Feng X, Shu W, Li M, Li J, Xu J, He M. Pathogenomics for accurate diagnosis, treatment, prognosis of oncology: a cutting edge overview. J Transl Med 2024; 22:131. [PMID: 38310237 PMCID: PMC10837897 DOI: 10.1186/s12967-024-04915-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/20/2024] [Indexed: 02/05/2024] Open
Abstract
The capability to gather heterogeneous data, alongside the increasing power of artificial intelligence to examine it, leading a revolution in harnessing multimodal data in the life sciences. However, most approaches are limited to unimodal data, leaving integrated approaches across modalities relatively underdeveloped in computational pathology. Pathogenomics, as an invasive method to integrate advanced molecular diagnostics from genomic data, morphological information from histopathological imaging, and codified clinical data enable the discovery of new multimodal cancer biomarkers to propel the field of precision oncology in the coming decade. In this perspective, we offer our opinions on synthesizing complementary modalities of data with emerging multimodal artificial intelligence methods in pathogenomics. It includes correlation between the pathological and genomic profile of cancer, fusion of histology, and genomics profile of cancer. We also present challenges, opportunities, and avenues for future work.
Collapse
Affiliation(s)
- Xiaobing Feng
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Wen Shu
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Mingya Li
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Junyu Li
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Junyao Xu
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Min He
- College of Electrical and Information Engineering, Hunan University, Changsha, China.
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China.
| |
Collapse
|
9
|
Xiao H, Rosen A, Chhibbar P, Moise L, Das J. From bench to bedside via bytes: Multi-omic immunoprofiling and integration using machine learning and network approaches. Hum Vaccin Immunother 2023; 19:2282803. [PMID: 38100557 PMCID: PMC10730168 DOI: 10.1080/21645515.2023.2282803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Accepted: 11/09/2023] [Indexed: 12/17/2023] Open
Abstract
A significant surge in research endeavors leverages the vast potential of high-throughput omic technology platforms for broad profiling of biological responses to vaccines and cutting-edge immunotherapies and stem-cell therapies under development. These profiles capture different aspects of core regulatory and functional processes at different scales of resolution from molecular and cellular to organismal. Systems approaches capture the complex and intricate interplay between these layers and scales. Here, we summarize experimental data modalities, for characterizing the genome, epigenome, transcriptome, proteome, metabolome, and antibody-ome, that enable us to generate large-scale immune profiles. We also discuss machine learning and network approaches that are commonly used to analyze and integrate these modalities, to gain insights into correlates and mechanisms of natural and vaccine-mediated immunity as well as therapy-induced immunomodulation.
Collapse
Affiliation(s)
- Hanxi Xiao
- Center for Systems Immunology, Departments of Immunology and Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Aaron Rosen
- Center for Systems Immunology, Departments of Immunology and Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Prabal Chhibbar
- Center for Systems Immunology, Departments of Immunology and Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | | | - Jishnu Das
- Center for Systems Immunology, Departments of Immunology and Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
10
|
Song D, Li K, Ge X, Li JJ. ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550107. [PMID: 37546812 PMCID: PMC10401959 DOI: 10.1101/2023.07.21.550107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE test for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to find cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
| | - Kexin Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA 02138
| |
Collapse
|