1
|
Kong T, Yu T, Zhao J, Hu Z, Xiong N, Wan J, Dong X, Pan Y, Zheng H, Zhang L. scGAA: a general gated axial-attention model for accurate cell-type annotation of single-cell RNA-seq data. Sci Rep 2024; 14:22308. [PMID: 39333739 PMCID: PMC11436728 DOI: 10.1038/s41598-024-73356-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2024] [Accepted: 09/17/2024] [Indexed: 09/30/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a key technology for investigating cell development and analysing cell diversity across various diseases. However, the high dimensionality and extreme sparsity of scRNA-seq data pose great challenges for accurate cell type annotation. To address this, we developed a new cell-type annotation model called scGAA (general gated axial-attention model for accurate cell-type annotation of scRNA-seq). Based on the transformer framework, the model decomposes the traditional self-attention mechanism into horizontal and vertical attention, considerably improving computational efficiency. This axial attention mechanism can process high-dimensional data more efficiently while maintaining reasonable model complexity. Additionally, the gated unit was integrated into the model to enhance the capture of relationships between genes, which is crucial for achieving an accurate cell type annotation. The results revealed that our improved transformer model is a promising tool for practical applications. This theoretical innovation increased the model performance and provided new insights into analytical tools for scRNA-seq data.
Collapse
Affiliation(s)
- Tianci Kong
- College of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Tiancheng Yu
- School of Sciences, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Jiaxin Zhao
- Department of Hepatobiliary and Pancreatic Surgery, Department of Surgery, Fourth Affiliated Hospital, School of Medicine, Zhejiang University, Yiwu, 322000, China
| | - Zhenhua Hu
- Department of Hepatobiliary and Pancreatic Surgery, Department of Surgery, Fourth Affiliated Hospital, School of Medicine, Zhejiang University, Yiwu, 322000, China
| | - Neal Xiong
- Department of Computer Science and Mathematics, Sul Ross State University, Alpine, USA
| | - Jian Wan
- College of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Xiaoliang Dong
- College of Information Science and Engineering, Shandong Agricultural University, Taian, 271018, China
| | - Yi Pan
- Faculty of Computer Science and Control Engineering Shenzhen University of Advanced Technology, Shenzhen, 518118, China
| | - Huilin Zheng
- College of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China.
| | - Lei Zhang
- College of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China.
- College of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China.
| |
Collapse
|
2
|
Qi Y, Wang X, Qin LX. Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach. ARXIV 2024:arXiv:2409.06180v1. [PMID: 39314504 PMCID: PMC11419172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.
Collapse
Affiliation(s)
- Yunhui Qi
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
- Department of Statistics, Iowa State University, Ames, IA, United States
| | - Xinyi Wang
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
- Department of Statistics, The University of California, Davis, CA, United States
| | - Li-Xuan Qin
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| |
Collapse
|
3
|
Friedrich VD, Pennitz P, Wyler E, Adler JM, Postmus D, Müller K, Teixeira Alves LG, Prigann J, Pott F, Vladimirova D, Hoefler T, Goekeri C, Landthaler M, Goffinet C, Saliba AE, Scholz M, Witzenrath M, Trimpert J, Kirsten H, Nouailles G. Neural network-assisted humanisation of COVID-19 hamster transcriptomic data reveals matching severity states in human disease. EBioMedicine 2024:105312. [PMID: 39317638 DOI: 10.1016/j.ebiom.2024.105312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 08/12/2024] [Accepted: 08/13/2024] [Indexed: 09/26/2024] Open
Abstract
BACKGROUND Translating findings from animal models to human disease is essential for dissecting disease mechanisms, developing and testing precise therapeutic strategies. The coronavirus disease 2019 (COVID-19) pandemic has highlighted this need, particularly for models showing disease severity-dependent immune responses. METHODS Single-cell transcriptomics (scRNAseq) is well poised to reveal similarities and differences between species at the molecular and cellular level with unprecedented resolution. However, computational methods enabling detailed matching are still scarce. Here, we provide a structured scRNAseq-based approach that we applied to scRNAseq from blood leukocytes originating from humans and hamsters affected with moderate or severe COVID-19. FINDINGS Integration of data from patients with COVID-19 with two hamster models that develop moderate (Syrian hamster, Mesocricetus auratus) or severe (Roborovski hamster, Phodopus roborovskii) disease revealed that most cellular states are shared across species. A neural network-based analysis using variational autoencoders quantified the overall transcriptomic similarity across species and severity levels, showing highest similarity between neutrophils of Roborovski hamsters and patients with severe COVID-19, while Syrian hamsters better matched patients with moderate disease, particularly in classical monocytes. We further used transcriptome-wide differential expression analysis to identify which disease stages and cell types display strongest transcriptional changes. INTERPRETATION Consistently, hamsters' response to COVID-19 was most similar to humans in monocytes and neutrophils. Disease-linked pathways found in all species specifically related to interferon response or inhibition of viral replication. Analysis of candidate genes and signatures supported the results. Our structured neural network-supported workflow could be applied to other diseases, allowing better identification of suitable animal models with similar pathomechanisms across species. FUNDING This work was supported by German Federal Ministry of Education and Research, (BMBF) grant IDs: 01ZX1304B, 01ZX1604B, 01ZX1906A, 01ZX1906B, 01KI2124, 01IS18026B and German Research Foundation (DFG) grant IDs: 14933180, 431232613.
Collapse
Affiliation(s)
- Vincent D Friedrich
- University of Leipzig, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany; Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Leipzig, Germany
| | - Peter Pennitz
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Infectious Diseases, Respiratory Medicine and Critical Care, Berlin, Germany
| | - Emanuel Wyler
- Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany
| | - Julia M Adler
- Freie Universität Berlin, Institut für Virologie, Berlin, Germany
| | - Dylan Postmus
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Virology, Berlin, Germany; Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; Liverpool School of Tropical Medicine, Department of Tropical Disease Biology, Liverpool, United Kingdom
| | - Kristina Müller
- University of Leipzig, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany
| | - Luiz Gustavo Teixeira Alves
- Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany
| | - Julia Prigann
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Virology, Berlin, Germany; Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; Gladstone Institutes, San Francisco, USA
| | - Fabian Pott
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Virology, Berlin, Germany; Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | | | - Thomas Hoefler
- Freie Universität Berlin, Institut für Virologie, Berlin, Germany
| | - Cengiz Goekeri
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Infectious Diseases, Respiratory Medicine and Critical Care, Berlin, Germany; Cyprus International University, Faculty of Medicine, Nicosia, Cyprus
| | - Markus Landthaler
- Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany; Humboldt-Universität zu Berlin, Institut fuer Biologie, Berlin, Germany
| | - Christine Goffinet
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Virology, Berlin, Germany; Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; Liverpool School of Tropical Medicine, Department of Tropical Disease Biology, Liverpool, United Kingdom
| | - Antoine-Emmanuel Saliba
- Faculty of Medicine, Institute of Molecular Infection Biology (IMIB), University of Würzburg, Würzburg, Germany; Helmholtz Institute for RNA-based Infection Research (HIRI), Helmholtz Center for Infection Research (HZI), Würzburg, Germany
| | - Markus Scholz
- University of Leipzig, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany; University of Leipzig, Faculty of Mathematics and Computer Science, Leipzig, Germany
| | - Martin Witzenrath
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Infectious Diseases, Respiratory Medicine and Critical Care, Berlin, Germany; German Center for Lung Research (DZL), Berlin, Germany
| | - Jakob Trimpert
- Freie Universität Berlin, Institut für Virologie, Berlin, Germany
| | - Holger Kirsten
- University of Leipzig, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany.
| | - Geraldine Nouailles
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Infectious Diseases, Respiratory Medicine and Critical Care, Berlin, Germany.
| |
Collapse
|
4
|
Jin W, Xia Y, Thela SR, Liu Y, Chen L. In silico generation and augmentation of regulatory variants from massively parallel reporter assay using conditional variational autoencoder. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600715. [PMID: 38979263 PMCID: PMC11230389 DOI: 10.1101/2024.06.25.600715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. Massively parallel reporter assays (MPRAs), which are an in vitro high-throughput method, can simultaneously test thousands of variants by evaluating the existence of allele specific regulatory activity. Nevertheless, the identified labelled variants by MPRAs, which shows differential allelic regulatory effects on the gene expression are usually limited to the scale of hundreds, limiting their potential to be used as the training set for achieving a robust genome-wide prediction. To address the limitation, we propose a deep generative model, MpraVAE, to in silico generate and augment the training sample size of labelled variants. By benchmarking on several MPRA datasets, we demonstrate that MpraVAE significantly improves the prediction performance for MPRA regulatory variants compared to the baseline method, conventional data augmentation approaches as well as existing variant scoring methods. Taking autoimmune diseases as one example, we apply MpraVAE to perform a genome-wide prediction of regulatory variants and find that predicted regulatory variants are more enriched than background variants in enhancers, active histone marks, open chromatin regions in immune-related cell types, and chromatin states associated with promoter, enhancer activity and binding sites of cMyC and Pol II that regulate gene expression. Importantly, predicted regulatory variants are found to link immune-related genes by leveraging chromatin loop and accessible chromatin, demonstrating the importance of MpraVAE in genetic and gene discovery for complex traits.
Collapse
Affiliation(s)
- Weijia Jin
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Yi Xia
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Sai Ritesh Thela
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Yunlong Liu
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Li Chen
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| |
Collapse
|
5
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
6
|
Selvarajoo K, Maurer-Stroh S. Towards multi-omics synthetic data integration. Brief Bioinform 2024; 25:bbae213. [PMID: 38711370 PMCID: PMC11074586 DOI: 10.1093/bib/bbae213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 04/19/2024] [Indexed: 05/08/2024] Open
Abstract
Across many scientific disciplines, the development of computational models and algorithms for generating artificial or synthetic data is gaining momentum. In biology, there is a great opportunity to explore this further as more and more big data at multi-omics level are generated recently. In this opinion, we discuss the latest trends in biological applications based on process-driven and data-driven aspects. Moving ahead, we believe these methodologies can help shape novel multi-omics-scale cellular inferences.
Collapse
Affiliation(s)
- Kumar Selvarajoo
- Biomolecular Sequence to Function Division, BII, (A*STAR), Singapore, 138671, Republic of Singapore
- Synthetic Biology Translational Research Program, Yong Loo Lin School of Medicine, NUS, Singapore, 117456, Republic of Singapore
- School of Biological Sciences, Nanyang Technological University (NTU), Singapore 639798, Republic of Singapore
| | - Sebastian Maurer-Stroh
- Biomolecular Sequence to Function Division, BII, (A*STAR), Singapore, 138671, Republic of Singapore
- Synthetic Biology Translational Research Program, Yong Loo Lin School of Medicine, NUS, Singapore, 117456, Republic of Singapore
| |
Collapse
|
7
|
Salimi A, Jang JH, Lee JY. Leveraging attention-enhanced variational autoencoders: Novel approach for investigating latent space of aptamer sequences. Int J Biol Macromol 2024; 255:127884. [PMID: 37926303 DOI: 10.1016/j.ijbiomac.2023.127884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 10/27/2023] [Accepted: 11/02/2023] [Indexed: 11/07/2023]
Abstract
Aptamers are increasingly recognized as potent alternatives to antibodies for diagnostic and therapeutic applications. The application of deep learning, particularly attention-based models, for aptamer (DNA/RNA) sequences is an innovative field. The ongoing advancements in aptamer sequencing technologies coupled with machine learning algorithms have resulted in novel developments. Further research is required to investigate the full potential of deep learning models and address the challenges associated with the generation of sequences, like the large search space of possible sequences. In this study, we propose a workflow that integrates an attention mechanism within a framework of a generative variational autoencoder, to generate novel sequences by expanding latent memory. They show 100 % novelty compared with the dataset, and approximately 88 % of them show negative values for the minimum free energy, which may indicate the likelihood of an RNA sequence folding into a functional structure. Because the field of aptamer discovery is affected by data scarcity, advanced strategies that facilitate the generation of diverse and superior sequences are necessitated. The utilization of our workflow can result in novel aptamers. Thus, investigations such as the present study can address the abovementioned challenge. Our research is anticipated to facilitate further discoveries and advancements in aptamer fields.
Collapse
Affiliation(s)
- Abbas Salimi
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Jee Hwan Jang
- School of Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea; Ucaretron Inc., No. 3508, 40, Simin-daero 365 beon-gil, Dongan-gu, Anyang-si, Gyeonggi-do, Republic of Korea.
| | - Jin Yong Lee
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| |
Collapse
|
8
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
9
|
Davalos OA, Heydari AA, Fertig EJ, Sindi SS, Hoyer KK. Boosting Single-Cell RNA Sequencing Analysis with Simple Neural Attention. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.29.542760. [PMID: 37398136 PMCID: PMC10312486 DOI: 10.1101/2023.05.29.542760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
A limitation of current deep learning (DL) approaches for single-cell RNA sequencing (scRNAseq) analysis is the lack of interpretability. Moreover, existing pipelines are designed and trained for specific tasks used disjointly for different stages of analysis. We present scANNA, a novel interpretable DL model for scRNAseq studies that leverages neural attention to learn gene associations. After training, the learned gene importance (interpretability) is used to perform downstream analyses (e.g., global marker selection and cell-type classification) without retraining. ScANNA's performance is comparable to or better than state-of-the-art methods designed and trained for specific standard scRNAseq analyses even though scANNA was not trained for these tasks explicitly. ScANNA enables researchers to discover meaningful results without extensive prior knowledge or training separate task-specific models, saving time and enhancing scRNAseq analyses.
Collapse
Affiliation(s)
- Oscar A. Davalos
- Quantitative and Systems Biology Graduate Program, University of California, Merced, CA, USA
| | - A. Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA
- Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Elana J. Fertig
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Suzanne S. Sindi
- Department of Applied Mathematics, University of California, Merced, CA, USA
- Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Katrina K. Hoyer
- Health Sciences Research Institute, University of California, Merced, CA, USA
- Department of Molecular and Cell Biology, School of Natural Sciences, University of California, Merced, CA, USA
| |
Collapse
|
10
|
Liu C, Huang H, Yang P. Multi-task learning from multimodal single-cell omics with Matilda. Nucleic Acids Res 2023; 51:e45. [PMID: 36912104 PMCID: PMC10164589 DOI: 10.1093/nar/gkad157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 01/28/2023] [Accepted: 02/21/2023] [Indexed: 03/14/2023] Open
Abstract
Multimodal single-cell omics technologies enable multiple molecular programs to be simultaneously profiled at a global scale in individual cells, creating opportunities to study biological systems at a resolution that was previously inaccessible. However, the analysis of multimodal single-cell omics data is challenging due to the lack of methods that can integrate across multiple data modalities generated from such technologies. Here, we present Matilda, a multi-task learning method for integrative analysis of multimodal single-cell omics data. By leveraging the interrelationship among tasks, Matilda learns to perform data simulation, dimension reduction, cell type classification, and feature selection in a single unified framework. We compare Matilda with other state-of-the-art methods on datasets generated from some of the most popular multimodal single-cell omics technologies. Our results demonstrate the utility of Matilda for addressing multiple key tasks on integrative multimodal single-cell omics data analysis. Matilda is implemented in Pytorch and is freely available from https://github.com/PYangLab/Matilda.
Collapse
Affiliation(s)
- Chunlei Liu
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Hao Huang
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
11
|
Jiao L, Wang G, Dai H, Li X, Wang S, Song T. scTransSort: Transformers for Intelligent Annotation of Cell Types by Gene Embeddings. Biomolecules 2023; 13:biom13040611. [PMID: 37189359 DOI: 10.3390/biom13040611] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 03/05/2023] [Accepted: 03/10/2023] [Indexed: 03/31/2023] Open
Abstract
Single-cell transcriptomics is rapidly advancing our understanding of the composition of complex tissues and biological cells, and single-cell RNA sequencing (scRNA-seq) holds great potential for identifying and characterizing the cell composition of complex tissues. Cell type identification by analyzing scRNA-seq data is mostly limited by time-consuming and irreproducible manual annotation. As scRNA-seq technology scales to thousands of cells per experiment, the exponential increase in the number of cell samples makes manual annotation more difficult. On the other hand, the sparsity of gene transcriptome data remains a major challenge. This paper applied the idea of the transformer to single-cell classification tasks based on scRNA-seq data. We propose scTransSort, a cell-type annotation method pretrained with single-cell transcriptomics data. The scTransSort incorporates a method of representing genes as gene expression embedding blocks to reduce the sparsity of data used for cell type identification and reduce the computational complexity. The feature of scTransSort is that its implementation of intelligent information extraction for unordered data, automatically extracting valid features of cell types without the need for manually labeled features and additional references. In experiments on cells from 35 human and 26 mouse tissues, scTransSort successfully elucidated its high accuracy and high performance for cell type identification, and demonstrated its own high robustness and generalization ability.
Collapse
|
12
|
Heydari AA, Sindi SS. Deep learning in spatial transcriptomics: Learning from the next next-generation sequencing. BIOPHYSICS REVIEWS 2023; 4:011306. [PMID: 38505815 PMCID: PMC10903438 DOI: 10.1063/5.0091135] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Accepted: 12/19/2022] [Indexed: 03/21/2024]
Abstract
Spatial transcriptomics (ST) technologies are rapidly becoming the extension of single-cell RNA sequencing (scRNAseq), holding the potential of profiling gene expression at a single-cell resolution while maintaining cellular compositions within a tissue. Having both expression profiles and tissue organization enables researchers to better understand cellular interactions and heterogeneity, providing insight into complex biological processes that would not be possible with traditional sequencing technologies. Data generated by ST technologies are inherently noisy, high-dimensional, sparse, and multi-modal (including histological images, count matrices, etc.), thus requiring specialized computational tools for accurate and robust analysis. However, many ST studies currently utilize traditional scRNAseq tools, which are inadequate for analyzing complex ST datasets. On the other hand, many of the existing ST-specific methods are built upon traditional statistical or machine learning frameworks, which have shown to be sub-optimal in many applications due to the scale, multi-modality, and limitations of spatially resolved data (such as spatial resolution, sensitivity, and gene coverage). Given these intricacies, researchers have developed deep learning (DL)-based models to alleviate ST-specific challenges. These methods include new state-of-the-art models in alignment, spatial reconstruction, and spatial clustering, among others. However, DL models for ST analysis are nascent and remain largely underexplored. In this review, we provide an overview of existing state-of-the-art tools for analyzing spatially resolved transcriptomics while delving deeper into the DL-based approaches. We discuss the new frontiers and the open questions in this field and highlight domains in which we anticipate transformational DL applications.
Collapse
|
13
|
Liu Q, Liang Y, Wang D, Li J. LFSC: A linear fast semi-supervised clustering algorithm that integrates reference-bulk and single-cell transcriptomes. Front Genet 2022; 13:1068075. [PMID: 36531230 PMCID: PMC9754124 DOI: 10.3389/fgene.2022.1068075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/09/2022] [Indexed: 08/29/2023] Open
Abstract
The identification of cell types in complex tissues is an important step in research into cellular heterogeneity in disease. We present a linear fast semi-supervised clustering (LFSC) algorithm that utilizes reference samples generated from bulk RNA sequencing data to identify cell types from single-cell transcriptomes. An anchor graph is constructed to depict the relationship between reference samples and cells. By applying a connectivity constraint to the learned graph, LFSC enables the preservation of the underlying cluster structure. Moreover, the overall complexity of LFSC is linear to the size of the data, which greatly improves effectiveness and efficiency. By applying LFSC to real single-cell RNA sequencing datasets, we discovered that it has superior performance over existing baseline methods in clustering accuracy and robustness. An application using infiltrating T cells in liver cancer demonstrates that LFSC can successfully find new cell types, discover differently expressed genes, and explore new cancer-associated biomarkers.
Collapse
Affiliation(s)
- Qiaoming Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yingjian Liang
- Department of General Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China
- Key Laboratory of Hepatosplenic Surgery, Ministry of Education, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Dong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jie Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
14
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|