1
|
Sin DD. What Single Cell RNA Sequencing Has Taught Us about Chronic Obstructive Pulmonary Disease. Tuberc Respir Dis (Seoul) 2024; 87:252-260. [PMID: 38369875 PMCID: PMC11222093 DOI: 10.4046/trd.2024.0001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 02/17/2024] [Indexed: 02/20/2024] Open
Abstract
Chronic obstructive pulmonary disease (COPD) affects close to 400 million people worldwide and is the 3rd leading cause of mortality. It is a heterogeneous disorder with multiple endophenotypes, each driven by specific molecular networks and processes. Therapeutic discovery in COPD has lagged behind other disease areas owing to a lack of understanding of its pathobiology and scarcity of biomarkers to guide therapies. Single cell RNA sequencing (scRNA-seq) is a powerful new tool to identify important cellular and molecular networks that play a crucial role in disease pathogenesis. This paper provides an overview of the scRNA-seq technology and its application in COPD and the lessons learned to date from scRNA-seq experiments in COPD.
Collapse
Affiliation(s)
- Don D. Sin
- Centre for Heart Lung Innovation, St. Paul’s Hospital and Division of Respiratory Medicine, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
2
|
Fang C, Selega A, Campbell KR. Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance. Genome Biol 2024; 25:159. [PMID: 38886757 PMCID: PMC11184819 DOI: 10.1186/s13059-024-03304-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 06/06/2024] [Indexed: 06/20/2024] Open
Abstract
BACKGROUND The advent of single-cell RNA-sequencing (scRNA-seq) has driven significant computational methods development for all steps in the scRNA-seq data analysis pipeline, including filtering, normalization, and clustering. The large number of methods and their resulting parameter combinations has created a combinatorial set of possible pipelines to analyze scRNA-seq data, which leads to the obvious question: which is best? Several benchmarking studies compare methods but frequently find variable performance depending on dataset and pipeline characteristics. Alternatively, the large number of scRNA-seq datasets along with advances in supervised machine learning raise a tantalizing possibility: could the optimal pipeline be predicted for a given dataset? RESULTS Here, we begin to answer this question by applying 288 scRNA-seq analysis pipelines to 86 datasets and quantifying pipeline success via a range of measures evaluating cluster purity and biological plausibility. We build supervised machine learning models to predict pipeline success given a range of dataset and pipeline characteristics. We find that prediction performance is significantly better than random and that in many cases pipelines predicted to perform well provide clustering outputs similar to expert-annotated cell type labels. We identify characteristics of datasets that correlate with strong prediction performance that could guide when such prediction models may be useful. CONCLUSIONS Supervised machine learning models have utility for recommending analysis pipelines and therefore the potential to alleviate the burden of choosing from the near-infinite number of possibilities. Different aspects of datasets influence the predictive performance of such models which will further guide users.
Collapse
Affiliation(s)
- Cindy Fang
- Lunenfeld-Tanenbaum Research Institute, Toronto, Canada
- Program in Bioinformatics and Computational Biology, University of Toronto, Toronto, Canada
- Present address: Department of Biostatistics, Johns Hopkins University, Baltimore, USA
| | - Alina Selega
- Lunenfeld-Tanenbaum Research Institute, Toronto, Canada
- Vector Institute, Toronto, Canada
| | - Kieran R Campbell
- Lunenfeld-Tanenbaum Research Institute, Toronto, Canada.
- Vector Institute, Toronto, Canada.
- Departments of Molecular Genetics, Statistical Sciences, Computer Science, University of Toronto, Toronto, Canada.
- Ontario Institute for Cancer Research, Toronto, Canada.
| |
Collapse
|
3
|
Sun Y, Kong L, Huang J, Deng H, Bian X, Li X, Cui F, Dou L, Cao C, Zou Q, Zhang Z. A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data. Brief Funct Genomics 2024:elae023. [PMID: 38860675 DOI: 10.1093/bfgp/elae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/29/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
Collapse
Affiliation(s)
- Yidi Sun
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lingling Kong
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Jiayi Huang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Hongyan Deng
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xinling Bian
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, United States
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 210029, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| |
Collapse
|
4
|
Canzar S, Do VH, Jelić S, Laue S, Matijević D, Prusina T. Metric multidimensional scaling for large single-cell datasets using neural networks. Algorithms Mol Biol 2024; 19:21. [PMID: 38863064 PMCID: PMC11165904 DOI: 10.1186/s13015-024-00265-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 05/22/2024] [Indexed: 06/13/2024] Open
Abstract
Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a simple neural network-based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.
Collapse
Affiliation(s)
- Stefan Canzar
- Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany.
| | - Van Hoan Do
- Center for Applied Mathematics and Informatics, Le Quy Don Technical University, Hanoi, Vietnam
| | - Slobodan Jelić
- School of Applied Mathematics and Informatics, University of Osijek, Osijek, Croatia
| | - Sören Laue
- Department of Informatics, Universität Hamburg, Hamburg, Germany
| | - Domagoj Matijević
- School of Applied Mathematics and Informatics, University of Osijek, Osijek, Croatia
| | - Tomislav Prusina
- Department of Informatics, Universität Hamburg, Hamburg, Germany
| |
Collapse
|
5
|
Shi M, Tian Y, Luo Y, Elze T, Wang M. RNFLT2Vec: Artifact-corrected representation learning for retinal nerve fiber layer thickness maps. Med Image Anal 2024; 94:103110. [PMID: 38458093 DOI: 10.1016/j.media.2024.103110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/09/2024] [Accepted: 02/15/2024] [Indexed: 03/10/2024]
Abstract
Optical coherence tomography imaging provides a crucial clinical measurement for diagnosing and monitoring glaucoma through the two-dimensional retinal nerve fiber layer (RNFL) thickness (RNFLT) map. Researchers have been increasingly using neural models to extract meaningful features from the RNFLT map, aiming to identify biomarkers for glaucoma and its progression. However, accurately representing the RNFLT map features relevant to glaucoma is challenging due to significant variations in retinal anatomy among individuals, which confound the pathological thinning of the RNFL. Moreover, the presence of artifacts in the RNFLT map, caused by segmentation errors in the context of degraded image quality and defective imaging procedures, further complicates the task. In this paper, we propose a general framework called RNFLT2Vec for unsupervised learning of vectorized feature representations from RNFLT maps. Our method includes an artifact correction component that learns to rectify RNFLT values at artifact locations, producing a representation reflecting the RNFLT map without artifacts. Additionally, we incorporate two regularization techniques to encourage discriminative representation learning. Firstly, we introduce a contrastive learning-based regularization to capture the similarities and dissimilarities between RNFLT maps. Secondly, we employ a consistency learning-based regularization to align pairwise distances of RNFLT maps with their corresponding thickness distributions. Through extensive experiments on a large-scale real-world dataset, we demonstrate the superiority of RNFLT2Vec in three different clinical tasks: RNFLT pattern discovery, glaucoma detection, and visual field prediction. Our results validate the effectiveness of our framework and its potential to contribute to a better understanding and diagnosis of glaucoma.
Collapse
Affiliation(s)
- Min Shi
- Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Yu Tian
- Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Yan Luo
- Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Tobias Elze
- Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Mengyu Wang
- Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
6
|
Wu J, Wang L, Xi S, Ma C, Zou F, Fang G, Liu F, Wang X, Qu L. Biological significance of METTL5 in atherosclerosis: comprehensive analysis of single-cell and bulk RNA sequencing data. Aging (Albany NY) 2024; 16:7267-7276. [PMID: 38663914 PMCID: PMC11087127 DOI: 10.18632/aging.205755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 03/27/2024] [Indexed: 05/08/2024]
Abstract
BACKGROUND N6-methyladenosine (m6A) methylation is involved in the pathogenesis of atherosclerosis (AS). Limited studies have examined the role of the m6A methyltransferase METTL5 in AS pathogenesis. METHODS This study subjected the AS dataset to differential analysis and weighted gene co-expression network analysis to identify m6A methylation-associated differentially expressed genes (DEGs). Next, the m6A methylation-related DEGs were subjected to consensus clustering to categorize AS samples into distinct m6A subtypes. Single-cell RNA sequencing (scRNA-seq) analysis was performed to investigate the proportions of each cell type in AS and adjacent healthy tissues and the expression levels of key m6A regulators. The mRNA expression levels of METTL5 in AS and healthy tissues were determined using quantitative real-time polymerase chain reaction (qRT-PCR) analysis. RESULTS AS samples were classified into two subtypes based on a five-m6A regulator-based model. scRNA-seq analysis revealed that the proportions of T cells, monocytes, and macrophages in AS tissues were significantly higher than those in healthy tissues. Additionally, the levels of m6A methylation were significantly different between AS and healthy tissues. METTL5 expression was upregulated in macrophages, smooth muscle cells (SMCs), and endothelial cells (ECs). qRT-PCR analysis demonstrated that the METTL5 mRNA level in AS tissues was downregulated when compared with that in healthy tissues. CONCLUSIONS METTL5 is a potential diagnostic marker for AS subtypes. Macrophages, SMCs, and ECs, which exhibit METTL5 upregulation, may modulate AS progression by regulating m6A methylation levels.
Collapse
Affiliation(s)
- Jianjin Wu
- Department of Vascular and Endovascular Surgery, Second Affiliated Hospital of Naval Medical University, Shanghai, China
| | - Lei Wang
- Department of Vascular Surgery, First Affiliated Hospital of Dalian Medical University, Dalian 116011, China
| | - Shuaishuai Xi
- Department of Vascular Surgery, Weifang Yidu Central Hospital, Weifang, Shandong, China
| | - Chao Ma
- Department of Vascular Surgery, Weifang Yidu Central Hospital, Weifang, Shandong, China
| | - Fukang Zou
- Department of Vascular and Endovascular Surgery, Second Affiliated Hospital of Naval Medical University, Shanghai, China
| | - Guanyu Fang
- Department of Vascular and Endovascular Surgery, Second Affiliated Hospital of Naval Medical University, Shanghai, China
| | - Fangbing Liu
- Department of Vascular and Endovascular Surgery, Second Affiliated Hospital of Naval Medical University, Shanghai, China
| | - Xiaokai Wang
- Department of Interventional and Vascular Surgery, The First People’s Hospital of Xuzhou, Xuzhou, Jiangsu, China
| | - Lefeng Qu
- Department of Vascular and Endovascular Surgery, Second Affiliated Hospital of Naval Medical University, Shanghai, China
| |
Collapse
|
7
|
Wang Y, Chen X, Zheng Z, Huang L, Xie W, Wang F, Zhang Z, Wong KC. scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics. iScience 2024; 27:109352. [PMID: 38510148 PMCID: PMC10951644 DOI: 10.1016/j.isci.2024.109352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 12/29/2023] [Accepted: 02/23/2024] [Indexed: 03/22/2024] Open
Abstract
Gene regulatory networks (GRNs) involve complex and multi-layer regulatory interactions between regulators and their target genes. Precise knowledge of GRNs is important in understanding cellular processes and molecular functions. Recent breakthroughs in single-cell sequencing technology made it possible to infer GRNs at single-cell level. Existing methods, however, are limited by expensive computations, and sometimes simplistic assumptions. To overcome these obstacles, we propose scGREAT, a framework to infer GRN using gene embeddings and transformer from single-cell transcriptomics. scGREAT starts by constructing gene expression and gene biotext dictionaries from scRNA-seq data and gene text information. The representation of TF gene pairs is learned through optimizing embedding space by transformer-based engine. Results illustrated scGREAT outperformed other contemporary methods on benchmarks. Besides, gene representations from scGREAT provide valuable gene regulation insights, and external validation on spatial transcriptomics illuminated the mechanism behind scGREAT annotation. Moreover, scGREAT identified several TF target regulations corroborated in studies.
Collapse
Affiliation(s)
- Yuchen Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Cutaneous Biology Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Zetian Zheng
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Lei Huang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Weidun Xie
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zhaolei Zhang
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China
| |
Collapse
|
8
|
Luecken MD, Gigante S, Burkhardt DB, Cannoodt R, Strobl DC, Markov NS, Zappia L, Palla G, Lewis W, Dimitrov D, Vinyard ME, Magruder DS, Andersson A, Dann E, Qin Q, Otto DJ, Klein M, Botvinnik OB, Deconinck L, Waldrant K, Bloom JM, Pisco AO, Saez-Rodriguez J, Wulsin D, Pinello L, Saeys Y, Theis FJ, Krishnaswamy S. Defining and benchmarking open problems in single-cell analysis. RESEARCH SQUARE 2024:rs.3.rs-4181617. [PMID: 38645152 PMCID: PMC11030530 DOI: 10.21203/rs.3.rs-4181617/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
With the growing number of single-cell analysis tools, benchmarks are increasingly important to guide analysis and method development. However, a lack of standardisation and extensibility in current benchmarks limits their usability, longevity, and relevance to the community. We present Open Problems, a living, extensible, community-guided benchmarking platform including 10 current single-cell tasks that we envision will raise standards for the selection, evaluation, and development of methods in single-cell analysis.
Collapse
Affiliation(s)
- Malte D Luecken
- Institute of computational Biology, Helmholtz Munich, Neuherberg, Germany
- Institute of Lung Health & Immunity, Helmholtz Munich; Member of the German Center for Lung Research (DZL), Munich, Germany
| | | | | | - Robrecht Cannoodt
- Data Intuitive, Lebbeke, Belgium
- Data Mining and Modelling for Biomedicine group, VIB Center for Inflammation Research, Ghent, Belgium
- Department of Applied Mathematics, Computer Science, and Statistics, Ghent University, Ghent, Belgium
| | - Daniel C Strobl
- Institute of computational Biology, Helmholtz Munich, Neuherberg, Germany
- Institute of Clinical Chemistry and Pathobiochemistry, School of Medicine, Technical University of Munich, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany
| | - Nikolay S Markov
- Division of Pulmonary and Critical Care Medicine, Feinberg School of Medicine, Northwestern University
| | - Luke Zappia
- Institute of computational Biology, Helmholtz Munich, Neuherberg, Germany
- Department of Mathematics, School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
| | - Giovanni Palla
- Institute of computational Biology, Helmholtz Munich, Neuherberg, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany
| | - Wesley Lewis
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| | - Daniel Dimitrov
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Heidelberg, Germany
| | - Michael E Vinyard
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
| | - D S Magruder
- Department of Computer Science, Yale University, New Haven CT, USA
| | - Alma Andersson
- Genentech Inc
- Royal Institute of Technology (KTH), Gene Technology
- Science for Life Laboratory (SciLifeLab)
| | - Emma Dann
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Qian Qin
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Dominik J Otto
- Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle WA
- Computational Biology Program, Public Health Sciences Division, Seattle WA
- Translational Data Science IRC, Fred Hutchinson Cancer Center, Seattle WA
| | | | - Olga Borisovna Botvinnik
- Data Sciences Platform, Chan Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158
- Bridge Bio Pharma, 3160 Porter Drive, Suite 250, Palo Alto, CA, 94304
| | - Louise Deconinck
- Data Mining and Modelling for Biomedicine group, VIB Center for Inflammation Research, Ghent, Belgium
- Department of Applied Mathematics, Computer Science, and Statistics, Ghent University, Ghent, Belgium
| | | | | | - Angela Oliveira Pisco
- Data Sciences Platform, Chan Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158
- Insitro, South San Francisco
| | - Julio Saez-Rodriguez
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Heidelberg, Germany
| | | | - Luca Pinello
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
| | - Yvan Saeys
- Data Mining and Modelling for Biomedicine group, VIB Center for Inflammation Research, Ghent, Belgium
- Department of Applied Mathematics, Computer Science, and Statistics, Ghent University, Ghent, Belgium
- VIB Center for AI & Computational Biology (VIB.AI), Gent, Belgium
| | - Fabian J Theis
- Institute of computational Biology, Helmholtz Munich, Neuherberg, Germany
- Department of Mathematics, School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
- Cellular Genetics Programme, Wellcome Sanger Institute, Hinxton, UK (associated faculty)
| | - Smita Krishnaswamy
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
- Department of Computer Science, Yale University, New Haven CT, USA
- Department of Genetics, Yale University, New Haven CT, USA
| |
Collapse
|
9
|
Yuan CU, Quah FX, Hemberg M. Single-cell and spatial transcriptomics: Bridging current technologies with long-read sequencing. Mol Aspects Med 2024; 96:101255. [PMID: 38368637 DOI: 10.1016/j.mam.2024.101255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 01/30/2024] [Accepted: 02/07/2024] [Indexed: 02/20/2024]
Abstract
Single-cell technologies have transformed biomedical research over the last decade, opening up new possibilities for understanding cellular heterogeneity, both at the genomic and transcriptomic level. In addition, more recent developments of spatial transcriptomics technologies have made it possible to profile cells in their tissue context. In parallel, there have been substantial advances in sequencing technologies, and the third generation of methods are able to produce reads that are tens of kilobases long, with error rates matching the second generation short reads. Long reads technologies make it possible to better map large genome rearrangements and quantify isoform specific abundances. This further improves our ability to characterize functionally relevant heterogeneity. Here, we show how researchers have begun to combine single-cell, spatial transcriptomics, and long-read technologies, and how this is resulting in powerful new approaches to profiling both the genome and the transcriptome. We discuss the achievements so far, and we highlight remaining challenges and opportunities.
Collapse
Affiliation(s)
- Chengwei Ulrika Yuan
- Department of Biochemistry, University of Cambridge, Cambridge, UK; Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Fu Xiang Quah
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Martin Hemberg
- Gene Lay Institute, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
10
|
Weine E, Carbonetto P, Stephens M. Accelerated dimensionality reduction of single-cell RNA sequencing data with fastglmpca. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.23.586420. [PMID: 38585920 PMCID: PMC10996495 DOI: 10.1101/2024.03.23.586420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca.
Collapse
Affiliation(s)
- Eric Weine
- Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Data Science, Dana Farber Cancer Institute, Boston, MA 02215, USA
| | - Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
11
|
Kang Y, Zhang H, Guan J. scINRB: single-cell gene expression imputation with network regularization and bulk RNA-seq data. Brief Bioinform 2024; 25:bbae148. [PMID: 38600665 PMCID: PMC11006796 DOI: 10.1093/bib/bbae148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 02/26/2024] [Accepted: 03/18/2024] [Indexed: 04/12/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) facilitates the study of cell type heterogeneity and the construction of cell atlas. However, due to its limitations, many genes may be detected to have zero expressions, i.e. dropout events, leading to bias in downstream analyses and hindering the identification and characterization of cell types and cell functions. Although many imputation methods have been developed, their performances are generally lower than expected across different kinds and dimensions of data and application scenarios. Therefore, developing an accurate and robust single-cell gene expression data imputation method is still essential. Considering to maintain the original cell-cell and gene-gene correlations and leverage bulk RNA sequencing (bulk RNA-seq) data information, we propose scINRB, a single-cell gene expression imputation method with network regularization and bulk RNA-seq data. scINRB adopts network-regularized non-negative matrix factorization to ensure that the imputed data maintains the cell-cell and gene-gene similarities and also approaches the gene average expression calculated from bulk RNA-seq data. To evaluate the performance, we test scINRB on simulated and experimental datasets and compare it with other commonly used imputation methods. The results show that scINRB recovers gene expression accurately even in the case of high dropout rates and dimensions, preserves cell-cell and gene-gene similarities and improves various downstream analyses including visualization, clustering and trajectory inference.
Collapse
Affiliation(s)
- Yue Kang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Hongyu Zhang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Jinting Guan
- Department of Automation, Xiamen University, Xiamen, Fujian, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China
| |
Collapse
|
12
|
Li T, Qian K, Wang X, Li WV, Li H. scBiG for representation learning of single-cell gene expression data based on bipartite graph embedding. NAR Genom Bioinform 2024; 6:lqae004. [PMID: 38288376 PMCID: PMC10823585 DOI: 10.1093/nargab/lqae004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/19/2023] [Accepted: 01/09/2024] [Indexed: 01/31/2024] Open
Abstract
Analyzing single-cell RNA sequencing (scRNA-seq) data remains a challenge due to its high dimensionality, sparsity and technical noise. Recognizing the benefits of dimensionality reduction in simplifying complexity and enhancing the signal-to-noise ratio, we introduce scBiG, a novel graph node embedding method designed for representation learning in scRNA-seq data. scBiG establishes a bipartite graph connecting cells and expressed genes, and then constructs a multilayer graph convolutional network to learn cell and gene embeddings. Through a series of extensive experiments, we demonstrate that scBiG surpasses commonly used dimensionality reduction techniques in various analytical tasks. Downstream tasks encompass unsupervised cell clustering, cell trajectory inference, gene expression reconstruction and gene co-expression analysis. Additionally, scBiG exhibits notable computational efficiency and scalability. In summary, scBiG offers a useful graph neural network framework for representation learning in scRNA-seq data, empowering a diverse array of downstream analyses.
Collapse
Affiliation(s)
- Ting Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Xiang Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, Riverside, CA 92507, USA
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| |
Collapse
|
13
|
Xia L, Lee C, Li JJ. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat Commun 2024; 15:1753. [PMID: 38409103 PMCID: PMC10897166 DOI: 10.1038/s41467-024-45891-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 02/06/2024] [Indexed: 02/28/2024] Open
Abstract
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, we present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell's 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. We show the effectiveness of scDEED on multiple datasets for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Collapse
Affiliation(s)
- Lucy Xia
- Department of ISOM, School of Business and Management, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Christy Lee
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
- Radcliffe Institute of Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
14
|
Chen Y, Zheng R, Liu J, Li M. scMLC: an accurate and robust multiplex community detection method for single-cell multi-omics data. Brief Bioinform 2024; 25:bbae101. [PMID: 38493339 PMCID: PMC10944569 DOI: 10.1093/bib/bbae101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 01/03/2024] [Accepted: 02/15/2024] [Indexed: 03/18/2024] Open
Abstract
Clustering cells based on single-cell multi-modal sequencing technologies provides an unprecedented opportunity to create high-resolution cell atlas, reveal cellular critical states and study health and diseases. However, effectively integrating different sequencing data for cell clustering remains a challenging task. Motivated by the successful application of Louvain in scRNA-seq data, we propose a single-cell multi-modal Louvain clustering framework, called scMLC, to tackle this problem. scMLC builds multiplex single- and cross-modal cell-to-cell networks to capture modal-specific and consistent information between modalities and then adopts a robust multiplex community detection method to obtain the reliable cell clusters. In comparison with 15 state-of-the-art clustering methods on seven real datasets simultaneously measuring gene expression and chromatin accessibility, scMLC achieves better accuracy and stability in most datasets. Synthetic results also indicate that the cell-network-based integration strategy of multi-omics data is superior to other strategies in terms of generalization. Moreover, scMLC is flexible and can be extended to single-cell sequencing data with more than two modalities.
Collapse
Affiliation(s)
- Yuxuan Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jin Liu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
15
|
Atitey K, Motsinger-Reif AA, Anchang B. Model-based evaluation of spatiotemporal data reduction methods with unknown ground truth through optimal visualization and interpretability metrics. Brief Bioinform 2023; 25:bbad455. [PMID: 38113074 PMCID: PMC10729792 DOI: 10.1093/bib/bbad455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/06/2023] [Accepted: 11/20/2023] [Indexed: 12/21/2023] Open
Abstract
Optimizing and benchmarking data reduction methods for dynamic or spatial visualization and interpretation (DSVI) face challenges due to many factors, including data complexity, lack of ground truth, time-dependent metrics, dimensionality bias and different visual mappings of the same data. Current studies often focus on independent static visualization or interpretability metrics that require ground truth. To overcome this limitation, we propose the MIBCOVIS framework, a comprehensive and interpretable benchmarking and computational approach. MIBCOVIS enhances the visualization and interpretability of high-dimensional data without relying on ground truth by integrating five robust metrics, including a novel time-ordered Markov-based structural metric, into a semi-supervised hierarchical Bayesian model. The framework assesses method accuracy and considers interaction effects among metric features. We apply MIBCOVIS using linear and nonlinear dimensionality reduction methods to evaluate optimal DSVI for four distinct dynamic and spatial biological processes captured by three single-cell data modalities: CyTOF, scRNA-seq and CODEX. These data vary in complexity based on feature dimensionality, unknown cell types and dynamic or spatial differences. Unlike traditional single-summary score approaches, MIBCOVIS compares accuracy distributions across methods. Our findings underscore the joint evaluation of visualization and interpretability, rather than relying on separate metrics. We reveal that prioritizing average performance can obscure method feature performance. Additionally, we explore the impact of data complexity on visualization and interpretability. Specifically, we provide optimal parameters and features and recommend methods, like the optimized variational contractive autoencoder, for targeted DSVI for various data complexities. MIBCOVIS shows promise for evaluating dynamic single-cell atlases and spatiotemporal data reduction models.
Collapse
Affiliation(s)
- Komlan Atitey
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| | - Alison A Motsinger-Reif
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| | - Benedict Anchang
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| |
Collapse
|
16
|
Hassan AZ, Ward HN, Rahman M, Billmann M, Lee Y, Myers CL. Dimensionality reduction methods for extracting functional networks from large-scale CRISPR screens. Mol Syst Biol 2023; 19:e11657. [PMID: 37750448 PMCID: PMC10632734 DOI: 10.15252/msb.202311657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2023] [Revised: 08/28/2023] [Accepted: 09/05/2023] [Indexed: 09/27/2023] Open
Abstract
CRISPR-Cas9 screens facilitate the discovery of gene functional relationships and phenotype-specific dependencies. The Cancer Dependency Map (DepMap) is the largest compendium of whole-genome CRISPR screens aimed at identifying cancer-specific genetic dependencies across human cell lines. A mitochondria-associated bias has been previously reported to mask signals for genes involved in other functions, and thus, methods for normalizing this dominant signal to improve co-essentiality networks are of interest. In this study, we explore three unsupervised dimensionality reduction methods-autoencoders, robust, and classical principal component analyses (PCA)-for normalizing the DepMap to improve functional networks extracted from these data. We propose a novel "onion" normalization technique to combine several normalized data layers into a single network. Benchmarking analyses reveal that robust PCA combined with onion normalization outperforms existing methods for normalizing the DepMap. Our work demonstrates the value of removing low-dimensional signals from the DepMap before constructing functional gene networks and provides generalizable dimensionality reduction-based normalization tools.
Collapse
Affiliation(s)
- Arshia Zernab Hassan
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Henry N Ward
- Bioinformatics and Computational Biology Graduate ProgramUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Mahfuzur Rahman
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Maximilian Billmann
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
- Institute of Human GeneticsUniversity of Bonn, School of Medicine and University Hospital BonnBonnGermany
| | - Yoonkyu Lee
- Bioinformatics and Computational Biology Graduate ProgramUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Chad L Myers
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
- Bioinformatics and Computational Biology Graduate ProgramUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| |
Collapse
|
17
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. Genome Biol 2023; 24:236. [PMID: 37858253 PMCID: PMC10588049 DOI: 10.1186/s13059-023-03067-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 09/20/2023] [Indexed: 10/21/2023] Open
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
18
|
Du J, Gu XR, Yu XX, Cao YJ, Hou J. Essential procedures of single-cell RNA sequencing in multiple myeloma and its translational value. BLOOD SCIENCE 2023; 5:221-236. [PMID: 37941914 PMCID: PMC10629747 DOI: 10.1097/bs9.0000000000000172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 09/18/2023] [Indexed: 11/10/2023] Open
Abstract
Multiple myeloma (MM) is a malignant neoplasm characterized by clonal proliferation of abnormal plasma cells. In many countries, it ranks as the second most prevalent malignant neoplasm of the hematopoietic system. Although treatment methods for MM have been continuously improved and the survival of patients has been dramatically prolonged, MM remains an incurable disease with a high probability of recurrence. As such, there are still many challenges to be addressed. One promising approach is single-cell RNA sequencing (scRNA-seq), which can elucidate the transcriptome heterogeneity of individual cells and reveal previously unknown cell types or states in complex tissues. In this review, we outlined the experimental workflow of scRNA-seq in MM, listed some commonly used scRNA-seq platforms and analytical tools. In addition, with the advent of scRNA-seq, many studies have made new progress in the key molecular mechanisms during MM clonal evolution, cell interactions and molecular regulation in the microenvironment, and drug resistance mechanisms in target therapy. We summarized the main findings and sequencing platforms for applying scRNA-seq to MM research and proposed broad directions for targeted therapies based on these findings.
Collapse
Affiliation(s)
- Jun Du
- Department of Hematology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China
| | - Xiao-Ran Gu
- School of Medicine, Shanghai Jiao Tong University, Shanghai 200025, China
| | - Xiao-Xiao Yu
- School of Medicine, Shanghai Jiao Tong University, Shanghai 200025, China
| | - Yang-Jia Cao
- Department of Hematology, First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, Shanxi 710000, China
| | - Jian Hou
- Department of Hematology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China
| |
Collapse
|
19
|
Xia L, Lee C, Li JJ. scDEED: a statistical method for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.21.537839. [PMID: 37163087 PMCID: PMC10168265 DOI: 10.1101/2023.04.21.537839] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-SNE and UMAP are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embedding might not reliably inform the similarities among cell clusters. Motivated by this challenge, we developed a statistical method, scDEED, for detecting dubious cell embeddings output by any 2D-embedding method. By calculating a reliability score for every cell embedding, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. Applied to multiple scRNA-seq datasets, scDEED demonstrates its effectiveness for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Collapse
|
20
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.03.531029. [PMID: 36945441 PMCID: PMC10028846 DOI: 10.1101/2023.03.03.531029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Department of Statistics, University of Chicago, Chicago, IL, USA
| |
Collapse
|
21
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
22
|
Gunawan I, Vafaee F, Meijering E, Lock JG. An introduction to representation learning for single-cell data analysis. CELL REPORTS METHODS 2023; 3:100547. [PMID: 37671013 PMCID: PMC10475795 DOI: 10.1016/j.crmeth.2023.100547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Single-cell-resolved systems biology methods, including omics- and imaging-based measurement modalities, generate a wealth of high-dimensional data characterizing the heterogeneity of cell populations. Representation learning methods are routinely used to analyze these complex, high-dimensional data by projecting them into lower-dimensional embeddings. This facilitates the interpretation and interrogation of the structures, dynamics, and regulation of cell heterogeneity. Reflecting their central role in analyzing diverse single-cell data types, a myriad of representation learning methods exist, with new approaches continually emerging. Here, we contrast general features of representation learning methods spanning statistical, manifold learning, and neural network approaches. We consider key steps involved in representation learning with single-cell data, including data pre-processing, hyperparameter optimization, downstream analysis, and biological validation. Interdependencies and contingencies linking these steps are also highlighted. This overview is intended to guide researchers in the selection, application, and optimization of representation learning strategies for current and future single-cell research applications.
Collapse
Affiliation(s)
- Ihuan Gunawan
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| | - Erik Meijering
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - John George Lock
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
- Ingham Institute for Applied Medical Research, Liverpool, NSW, Australia
| |
Collapse
|
23
|
Kana O, Nault R, Filipovic D, Marri D, Zacharewski T, Bhattacharya S. Generative modeling of single-cell gene expression for dose-dependent chemical perturbations. PATTERNS (NEW YORK, N.Y.) 2023; 4:100817. [PMID: 37602218 PMCID: PMC10436058 DOI: 10.1016/j.patter.2023.100817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/07/2022] [Accepted: 07/14/2023] [Indexed: 08/22/2023]
Abstract
Single-cell sequencing reveals the heterogeneity of cellular response to chemical perturbations. However, testing all relevant combinations of cell types, chemicals, and doses is a daunting task. A deep generative learning formalism called variational autoencoders (VAEs) has been effective in predicting single-cell gene expression perturbations for single doses. Here, we introduce single-cell variational inference of dose-response (scVIDR), a VAE-based model that predicts both single-dose and multiple-dose cellular responses better than existing models. We show that scVIDR can predict dose-dependent gene expression across mouse hepatocytes, human blood cells, and cancer cell lines. We biologically interpret the latent space of scVIDR using a regression model and use scVIDR to order individual cells based on their sensitivity to chemical perturbation by assigning each cell a "pseudo-dose" value. We envision that scVIDR can help reduce the need for repeated animal testing across tissues, chemicals, and doses.
Collapse
Affiliation(s)
- Omar Kana
- Department of Pharmacology and Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Rance Nault
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology Michigan State University, Michigan State University, East Lansing, MI 48824, USA
| | - David Filipovic
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biomedical Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Daniel Marri
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biomedical Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Tim Zacharewski
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology Michigan State University, Michigan State University, East Lansing, MI 48824, USA
| | - Sudin Bhattacharya
- Department of Pharmacology and Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biomedical Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
24
|
Morales-Hernandez AG, Martinez-Aguilar V, Chavez-Gonzalez TM, Mendez-Avila JC, Frias-Becerril JV, Morales-Hernandez LA, Cruz-Albarran IA. Short-Term Thermal Effect of Continuous Ultrasound from 3 MHz to 1 and 0.5 W/cm 2 Applied to Gastrocnemius Muscle. Diagnostics (Basel) 2023; 13:2644. [PMID: 37627903 PMCID: PMC10453025 DOI: 10.3390/diagnostics13162644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 08/05/2023] [Indexed: 08/27/2023] Open
Abstract
Continuous ultrasound is recognized for its thermal effect and use in the tissue repair process. However, there is controversy about its dosage and efficacy. This study used infrared thermography, a non-invasive technique, to measure the short-term thermal effect of 3 MHz continuous ultrasound vs. a placebo, referencing the intensity applied. It was a single-blind, randomized clinical trial of 60 healthy volunteers (19-24 years old) divided into three equal groups. Group 1:1 W/cm2 for 5 min; Group 2: 0.5 W/cm2 for 10 min; and Group 3: the placebo for 5 min. The temperature was recorded through five thermographic images per patient: pre- and post-application, 5, 10, and 15 min later. After statistical analysis, a more significant decrease in temperature (p<0.05 ) was observed in the placebo group compared with the remaining groups after the application of continuous ultrasound. Group 1 was the one that generated the highest significant thermal effect (p<0.001), with an increase of 3.05 °C at 15 min, compared with the other two groups. It is concluded that to generate a thermal effect in the muscle, intensities of ≥1 W/cm2 are required, since the dosage maintained a temperature increase for more than 5 min.
Collapse
Affiliation(s)
- Arely G. Morales-Hernandez
- Faculty of Nursing, Autonomous University of Queretaro, Queretaro 76010, Mexico
- Education, Movement and Health, Faculty of Nursing, Autonomous University of Queretaro, Queretaro 76010, Mexico
| | - Violeta Martinez-Aguilar
- Faculty of Nursing, Autonomous University of Queretaro, Campus Corregidora, Queretaro 76912, Mexico
| | | | - Julio C. Mendez-Avila
- Faculty of Nursing, Autonomous University of Queretaro, Queretaro 76010, Mexico
- Education, Movement and Health, Faculty of Nursing, Autonomous University of Queretaro, Queretaro 76010, Mexico
| | | | - Luis A. Morales-Hernandez
- Laboratory of Artificial Vision and Thermography/Mechatronics, Faculty of Engineering, Autonomous University of Queretaro, Campus San Juan del Rio, San Juan del Río 76807, Mexico
| | - Irving A. Cruz-Albarran
- Faculty of Nursing, Autonomous University of Queretaro, Queretaro 76010, Mexico
- Laboratory of Artificial Vision and Thermography/Mechatronics, Faculty of Engineering, Autonomous University of Queretaro, Campus San Juan del Rio, San Juan del Río 76807, Mexico
- Artificial Intelligence Systems Applied to Biomedical and Mechanical Models, Faculty of Engineering, Autonomus University of Queretaro, Campus San Juan del Rio, San Juan del Rio 76807, Mexico
| |
Collapse
|
25
|
Raimundo F, Prompsy P, Vert JP, Vallot C. A benchmark of computational pipelines for single-cell histone modification data. Genome Biol 2023; 24:143. [PMID: 37340307 PMCID: PMC10280832 DOI: 10.1186/s13059-023-02981-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 06/07/2023] [Indexed: 06/22/2023] Open
Abstract
BACKGROUND Single-cell histone post translational modification (scHPTM) assays such as scCUT&Tag or scChIP-seq allow single-cell mapping of diverse epigenomic landscapes within complex tissues and are likely to unlock our understanding of various mechanisms involved in development or diseases. Running scHTPM experiments and analyzing the data produced remains challenging since few consensus guidelines currently exist regarding good practices for experimental design and data analysis pipelines. RESULTS We perform a computational benchmark to assess the impact of experimental parameters and data analysis pipelines on the ability of the cell representation to recapitulate known biological similarities. We run more than ten thousand experiments to systematically study the impact of coverage and number of cells, of the count matrix construction method, of feature selection and normalization, and of the dimension reduction algorithm used. This allows us to identify key experimental parameters and computational choices to obtain a good representation of single-cell HPTM data. We show in particular that the count matrix construction step has a strong influence on the quality of the representation and that using fixed-size bin counts outperforms annotation-based binning. Dimension reduction methods based on latent semantic indexing outperform others, and feature selection is detrimental, while keeping only high-quality cells has little influence on the final representation as long as enough cells are analyzed. CONCLUSIONS This benchmark provides a comprehensive study on how experimental parameters and computational choices affect the representation of single-cell HPTM data. We propose a series of recommendations regarding matrix construction, feature and cell selection, and dimensionality reduction algorithms.
Collapse
Affiliation(s)
- Félix Raimundo
- Google Research, Brain team, 75009, Paris, France
- Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France
| | - Pacôme Prompsy
- Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France
- CNRS UMR3244, Institut Curie, PSL Research University, 75005, Paris, France
| | - Jean-Philippe Vert
- Google Research, Brain team, 75009, Paris, France.
- Owkin, Inc, NY, New York, USA.
| | - Céline Vallot
- Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France.
- CNRS UMR3244, Institut Curie, PSL Research University, 75005, Paris, France.
| |
Collapse
|
26
|
Li K, Sun YH, Ouyang Z, Negi S, Gao Z, Zhu J, Wang W, Chen Y, Piya S, Hu W, Zavodszky MI, Yalamanchili H, Cao S, Gehrke A, Sheehan M, Huh D, Casey F, Zhang X, Zhang B. scRNASequest: an ecosystem of scRNA-seq analysis, visualization, and publishing. BMC Genomics 2023; 24:228. [PMID: 37131143 PMCID: PMC10155351 DOI: 10.1186/s12864-023-09332-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Accepted: 04/25/2023] [Indexed: 05/04/2023] Open
Abstract
BACKGROUND Single-cell RNA sequencing is a state-of-the-art technology to understand gene expression in complex tissues. With the growing amount of data being generated, the standardization and automation of data analysis are critical to generating hypotheses and discovering biological insights. RESULTS Here, we present scRNASequest, a semi-automated single-cell RNA-seq (scRNA-seq) data analysis workflow which allows (1) preprocessing from raw UMI count data, (2) harmonization by one or multiple methods, (3) reference-dataset-based cell type label transfer and embedding projection, (4) multi-sample, multi-condition single-cell level differential gene expression analysis, and (5) seamless integration with cellxgene VIP for visualization and with CellDepot for data hosting and sharing by generating compatible h5ad files. CONCLUSIONS We developed scRNASequest, an end-to-end pipeline for single-cell RNA-seq data analysis, visualization, and publishing. The source code under MIT open-source license is provided at https://github.com/interactivereport/scRNASequest . We also prepared a bookdown tutorial for the installation and detailed usage of the pipeline: https://interactivereport.github.io/scRNAsequest/tutorial/docs/ . Users have the option to run it on a local computer with a Linux/Unix system including MacOS, or interact with SGE/Slurm schedulers on high-performance computing (HPC) clusters.
Collapse
Affiliation(s)
- Kejie Li
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Yu H Sun
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | | | - Soumya Negi
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Zhen Gao
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Jing Zhu
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Wanli Wang
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Yirui Chen
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Sarbottam Piya
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Wenxing Hu
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Maria I Zavodszky
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Hima Yalamanchili
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Shaolong Cao
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Andrew Gehrke
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Mark Sheehan
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Dann Huh
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Fergal Casey
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA
| | - Xinmin Zhang
- Data Science, BioInfoRx Inc., Madison, WI, 53719, USA
| | - Baohong Zhang
- Research Data Sciences, Translational Biology, Biogen Inc., Cambridge, MA, 02142, USA.
| |
Collapse
|
27
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
28
|
Zhang Z, Wei X. Artificial intelligence-assisted selection and efficacy prediction of antineoplastic strategies for precision cancer therapy. Semin Cancer Biol 2023; 90:57-72. [PMID: 36796530 DOI: 10.1016/j.semcancer.2023.02.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 01/12/2023] [Accepted: 02/13/2023] [Indexed: 02/16/2023]
Abstract
The rapid development of artificial intelligence (AI) technologies in the context of the vast amount of collectable data obtained from high-throughput sequencing has led to an unprecedented understanding of cancer and accelerated the advent of a new era of clinical oncology with a tone of precision treatment and personalized medicine. However, the gains achieved by a variety of AI models in clinical oncology practice are far from what one would expect, and in particular, there are still many uncertainties in the selection of clinical treatment options that pose significant challenges to the application of AI in clinical oncology. In this review, we summarize emerging approaches, relevant datasets and open-source software of AI and show how to integrate them to address problems from clinical oncology and cancer research. We focus on the principles and procedures for identifying different antitumor strategies with the assistance of AI, including targeted cancer therapy, conventional cancer therapy, and cancer immunotherapy. In addition, we also highlight the current challenges and directions of AI in clinical oncology translation. Overall, we hope this article will provide researchers and clinicians with a deeper understanding of the role and implications of AI in precision cancer therapy, and help AI move more quickly into accepted cancer guidelines.
Collapse
Affiliation(s)
- Zhe Zhang
- Laboratory of Aging Research and Cancer Drug Target, State Key Laboratory of Biotherapy and Cancer Center, National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, PR China; State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, and Collaborative Innovation Center for Biotherapy, Chengdu 610041, PR China
| | - Xiawei Wei
- Laboratory of Aging Research and Cancer Drug Target, State Key Laboratory of Biotherapy and Cancer Center, National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, PR China.
| |
Collapse
|
29
|
Wang K, Yang Y, Wu F, Song B, Wang X, Wang T. Comparative analysis of dimension reduction methods for cytometry by time-of-flight data. Nat Commun 2023; 14:1836. [PMID: 37005472 PMCID: PMC10067013 DOI: 10.1038/s41467-023-37478-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 03/20/2023] [Indexed: 04/04/2023] Open
Abstract
While experimental and informatic techniques around single cell sequencing (scRNA-seq) are advanced, research around mass cytometry (CyTOF) data analysis has severely lagged behind. CyTOF data are notably different from scRNA-seq data in many aspects. This calls for the evaluation and development of computational methods specific for CyTOF data. Dimension reduction (DR) is one of the critical steps of single cell data analysis. Here, we benchmark the performances of 21 DR methods on 110 real and 425 synthetic CyTOF samples. We find that less well-known methods like SAUCIE, SQuaD-MDS, and scvis are the overall best performers. In particular, SAUCIE and scvis are well balanced, SQuaD-MDS excels at structure preservation, whereas UMAP has great downstream analysis performance. We also find that t-SNE (along with SQuad-MDS/t-SNE Hybrid) possesses the best local structure preservation. Nevertheless, there is a high level of complementarity between these tools, so the choice of method should depend on the underlying data structure and the analytical needs.
Collapse
Affiliation(s)
- Kaiwen Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, 75275, USA
| | - Yuqiu Yang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Fangjiang Wu
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Bing Song
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, 75275, USA.
- Department of Mathematics, University of Texas at Arlington, Arlington, TX, 76019, USA.
- Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, 76019, USA.
| | - Tao Wang
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
30
|
Crowell HL, Morillo Leonardo SX, Soneson C, Robinson MD. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol 2023; 24:62. [PMID: 36991470 PMCID: PMC10061781 DOI: 10.1186/s13059-023-02904-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/20/2023] [Indexed: 03/31/2023] Open
Abstract
BACKGROUND With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Collapse
Affiliation(s)
- Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | | | - Charlotte Soneson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Current address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
| |
Collapse
|
31
|
Zernab Hassan A, Ward HN, Rahman M, Billmann M, Lee Y, Myers CL. Dimensionality reduction methods for extracting functional networks from large-scale CRISPR screens. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.22.529573. [PMID: 36993440 PMCID: PMC10054965 DOI: 10.1101/2023.02.22.529573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
CRISPR-Cas9 screens facilitate the discovery of gene functional relationships and phenotype-specific dependencies. The Cancer Dependency Map (DepMap) is the largest compendium of whole-genome CRISPR screens aimed at identifying cancer-specific genetic dependencies across human cell lines. A mitochondria-associated bias has been previously reported to mask signals for genes involved in other functions, and thus, methods for normalizing this dominant signal to improve co-essentiality networks are of interest. In this study, we explore three unsupervised dimensionality reduction methods - autoencoders, robust, and classical principal component analyses (PCA) - for normalizing the DepMap to improve functional networks extracted from these data. We propose a novel "onion" normalization technique to combine several normalized data layers into a single network. Benchmarking analyses reveal that robust PCA combined with onion normalization outperforms existing methods for normalizing the DepMap. Our work demonstrates the value of removing low-dimensional signals from the DepMap before constructing functional gene networks and provides generalizable dimensionality reduction-based normalization tools.
Collapse
Affiliation(s)
- Arshia Zernab Hassan
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Henry N Ward
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Mahfuzur Rahman
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Maximilian Billmann
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
- Institute of Human Genetics, University of Bonn, School of Medicine and University Hospital Bonn, Bonn, Germany
| | - Yoonkyu Lee
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| |
Collapse
|
32
|
Choi Y, Li R, Quon G. siVAE: interpretable deep generative models for single-cell transcriptomes. Genome Biol 2023; 24:29. [PMID: 36803416 PMCID: PMC9940350 DOI: 10.1186/s13059-023-02850-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 01/06/2023] [Indexed: 02/22/2023] Open
Abstract
Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.
Collapse
Affiliation(s)
- Yongin Choi
- Graduate Group in Biomedical Engineering, University of California, Davis, Davis, CA, USA
- Genome Center, University of California, Davis, Davis, CA, USA
| | - Ruoxin Li
- Genome Center, University of California, Davis, Davis, CA, USA
- Graduate Group in Biostatistics, University of California, Davis, Davis, CA, USA
| | - Gerald Quon
- Graduate Group in Biomedical Engineering, University of California, Davis, Davis, CA, USA.
- Genome Center, University of California, Davis, Davis, CA, USA.
- Department of Molecular and Cellular Biology, University of California, Davis, Davis, CA, USA.
| |
Collapse
|
33
|
Hsu LL, Culhane AC. Correspondence analysis for dimension reduction, batch integration, and visualization of single-cell RNA-seq data. Sci Rep 2023; 13:1197. [PMID: 36681709 PMCID: PMC9867729 DOI: 10.1038/s41598-022-26434-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 12/14/2022] [Indexed: 01/22/2023] Open
Abstract
Effective dimension reduction is essential for single cell RNA-seq (scRNAseq) analysis. Principal component analysis (PCA) is widely used, but requires continuous, normally-distributed data; therefore, it is often coupled with log-transformation in scRNAseq applications, which can distort the data and obscure meaningful variation. We describe correspondence analysis (CA), a count-based alternative to PCA. CA is based on decomposition of a chi-squared residual matrix, avoiding distortive log-transformation. To address overdispersion and high sparsity in scRNAseq data, we propose five adaptations of CA, which are fast, scalable, and outperform standard CA and glmPCA, to compute cell embeddings with more performant or comparable clustering accuracy in 8 out of 9 datasets. In particular, we find that CA with Freeman-Tukey residuals performs especially well across diverse datasets. Other advantages of the CA framework include visualization of associations between genes and cell populations in a "CA biplot," and extension to multi-table analysis; we introduce corralm for integrative multi-table dimension reduction of scRNAseq data. We implement CA for scRNAseq data in corral, an R/Bioconductor package which interfaces directly with single cell classes in Bioconductor. Switching from PCA to CA is achieved through a simple pipeline substitution and improves dimension reduction of scRNAseq datasets.
Collapse
Affiliation(s)
- Lauren L Hsu
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
- Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Aedín C Culhane
- Limerick Digital Cancer Research Centre, Health Research Institute, School of Medicine, University of Limerick, Limerick, Ireland.
| |
Collapse
|
34
|
Drake RS, Villanueva MA, Vilme M, Russo DD, Navia A, Love JC, Shalek AK. Profiling Transcriptional Heterogeneity with Seq-Well S 3: A Low-Cost, Portable, High-Fidelity Platform for Massively Parallel Single-Cell RNA-Seq. Methods Mol Biol 2023; 2584:57-104. [PMID: 36495445 DOI: 10.1007/978-1-0716-2756-3_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Seq-Well is a high-throughput, picowell-based single-cell RNA-seq technology that can be used to simultaneously profile the transcriptomes of thousands of cells (Gierahn et al. Nat Methods 14(4):395-398, 2017). Relative to its reverse-emulsion-droplet-based counterparts, Seq-Well addresses key cost, portability, and scalability limitations. Recently, we introduced an improved molecular biology for Seq-Well to enhance the information content that can be captured from individual cells using the platform. This update, which we call Seq-Well S3 (S3: Second-Strand Synthesis), incorporates a second-strand-synthesis step after reverse transcription to boost the detection of cellular transcripts normally missed when running the original Seq-Well protocol (Hughes et al. Immunity 53(4):878-894.e7, 2020). This chapter provides details and tips on how to perform Seq-Well S3, along with general pointers on how to subsequently analyze the resultant single-cell RNA-seq data.
Collapse
Affiliation(s)
- Riley S Drake
- Institute for Medical Engineering and Science (IMES), Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Martin Arreola Villanueva
- Institute for Medical Engineering and Science (IMES), Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA.
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Mike Vilme
- Institute for Medical Engineering and Science (IMES), Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Daniela D Russo
- Institute for Medical Engineering and Science (IMES), Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Andrew Navia
- Institute for Medical Engineering and Science (IMES), Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - J Christopher Love
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA.
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Alex K Shalek
- Institute for Medical Engineering and Science (IMES), Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA.
- The Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
35
|
Chatterjee D, Deng WM. Standardization of Single-Cell RNA-Sequencing Analysis Workflow to Study Drosophila Ovary. Methods Mol Biol 2023; 2677:151-171. [PMID: 37464241 DOI: 10.1007/978-1-0716-3259-8_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/20/2023]
Abstract
Developments in single-cell technology have considerably changed the way we study biology. Significant efforts have been made over the last few years to build comprehensive cell-type-specific transcriptomic atlases for a wide range of tissues in several model organisms in order to discover cell-type-specific markers and drivers of gene expression. One such tissue is the ovary of the fruit-fly Drosophila melanogaster, which is a popular model system with wide-ranging applications in the study of both development and disease. Three independent studies have recently produced comprehensive maps of cell-type-specific gene expression that describe both spatiotemporal regulation of the process of oogenesis and unique transcriptomic profiles of different cell types that constitute the ovary. In this chapter, we outlined the wet-lab protocol that was followed in our recent study for sample preparation and reanalyze the resultant dataset to discuss the benchmarks in data analysis, which are fundamental to comprehensive curation of the single-cell dataset representing the fly ovary.
Collapse
Affiliation(s)
- Deeptiman Chatterjee
- Department of Biochemistry and Molecular Biology, Tulane University School of Medicine, Tulane Cancer Center, New Orleans, LA, USA.
- Current address: Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Wu-Min Deng
- Department of Biochemistry and Molecular Biology, Tulane University School of Medicine, Tulane Cancer Center, New Orleans, LA, USA.
| |
Collapse
|
36
|
Wu S, Schmitz U. Single-cell and long-read sequencing to enhance modelling of splicing and cell-fate determination. Comput Struct Biotechnol J 2023; 21:2373-2380. [PMID: 37066125 PMCID: PMC10091034 DOI: 10.1016/j.csbj.2023.03.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 03/13/2023] [Accepted: 03/13/2023] [Indexed: 04/03/2023] Open
Abstract
Single-cell sequencing technologies have revolutionised the life sciences and biomedical research. Single-cell sequencing provides high-resolution data on cell heterogeneity, allowing high-fidelity cell type identification, and lineage tracking. Computational algorithms and mathematical models have been developed to make sense of the data, compensate for errors and simulate the biological processes, which has led to breakthroughs in our understanding of cell differentiation, cell-fate determination and tissue cell composition. The development of long-read (a.k.a. third-generation) sequencing technologies has produced powerful tools for investigating alternative splicing, isoform expression (at the RNA level), genome assembly and the detection of complex structural variants (at the DNA level). In this review, we provide an overview of the recent advancements in single-cell and long-read sequencing technologies, with a particular focus on the computational algorithms that help in correcting, analysing, and interpreting the resulting data. Additionally, we review some mathematical models that use single-cell and long-read sequencing data to study cell-fate determination and alternative splicing, respectively. Moreover, we highlight the emerging opportunities in modelling cell-fate determination that result from the combination of single-cell and long-read sequencing technologies.
Collapse
|
37
|
Quah FX, Hemberg M. SC3s: efficient scaling of single cell consensus clustering to millions of cells. BMC Bioinformatics 2022; 23:536. [PMID: 36503522 PMCID: PMC9743492 DOI: 10.1186/s12859-022-05085-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 11/25/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Today it is possible to profile the transcriptome of individual cells, and a key step in the analysis of these datasets is unsupervised clustering. For very large datasets, efficient algorithms are required to ensure that analyses can be conducted with reasonable time and memory requirements. RESULTS Here, we present a highly efficient k-means based approach, and we demonstrate that it scales favorably with the number of cells with regards to time and memory. CONCLUSIONS We have demonstrated that our streaming k-means clustering algorithm gives state-of-the-art performance while resource requirements scale favorably for up to 2 million cells.
Collapse
Affiliation(s)
- Fu Xiang Quah
- grid.10306.340000 0004 0606 5382Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA UK ,grid.5335.00000000121885934The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN UK
| | - Martin Hemberg
- grid.10306.340000 0004 0606 5382Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA UK ,grid.38142.3c000000041936754XPresent Address: Evergrande Center for Immunologic Diseases, Harvard Medical School and Brigham and Women’s Hospital, 75 Francis Street, Boston, MA 02115 USA
| |
Collapse
|
38
|
Spatial-ID: a cell typing method for spatially resolved transcriptomics via transfer learning and spatial embedding. Nat Commun 2022; 13:7640. [PMID: 36496406 PMCID: PMC9741613 DOI: 10.1038/s41467-022-35288-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Accepted: 11/25/2022] [Indexed: 12/13/2022] Open
Abstract
Spatially resolved transcriptomics provides the opportunity to investigate the gene expression profiles and the spatial context of cells in naive state, but at low transcript detection sensitivity or with limited gene throughput. Comprehensive annotating of cell types in spatially resolved transcriptomics to understand biological processes at the single cell level remains challenging. Here we propose Spatial-ID, a supervision-based cell typing method, that combines the existing knowledge of reference single-cell RNA-seq data and the spatial information of spatially resolved transcriptomics data. We present a series of benchmarking analyses on publicly available spatially resolved transcriptomics datasets, that demonstrate the superiority of Spatial-ID compared with state-of-the-art methods. Besides, we apply Spatial-ID on a self-collected mouse brain hemisphere dataset measured by Stereo-seq, that shows the scalability of Spatial-ID to three-dimensional large field tissues with subcellular spatial resolution.
Collapse
|
39
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China.,Department of Immunology, Nanjing Medical University, Nanjing, 211166, China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children's Hospital of Chongqing Medical University, Chongqing, 401174, China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110, Guangdong, China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China.
| | - Murray J Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW, 2308, Australia. .,Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW, 2305, Australia.
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China.
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China.
| |
Collapse
|
40
|
Spatially aware dimension reduction for spatial transcriptomics. Nat Commun 2022; 13:7203. [PMID: 36418351 PMCID: PMC9684472 DOI: 10.1038/s41467-022-34879-1] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 11/10/2022] [Indexed: 11/27/2022] Open
Abstract
Spatial transcriptomics are a collection of genomic technologies that have enabled transcriptomic profiling on tissues with spatial localization information. Analyzing spatial transcriptomic data is computationally challenging, as the data collected from various spatial transcriptomic technologies are often noisy and display substantial spatial correlation across tissue locations. Here, we develop a spatially-aware dimension reduction method, SpatialPCA, that can extract a low dimensional representation of the spatial transcriptomics data with biological signal and preserved spatial correlation structure, thus unlocking many existing computational tools previously developed in single-cell RNAseq studies for tailored analysis of spatial transcriptomics. We illustrate the benefits of SpatialPCA for spatial domain detection and explores its utility for trajectory inference on the tissue and for high-resolution spatial map construction. In the real data applications, SpatialPCA identifies key molecular and immunological signatures in a detected tumor surrounding microenvironment, including a tertiary lymphoid structure that shapes the gradual transcriptomic transition during tumorigenesis and metastasis. In addition, SpatialPCA detects the past neuronal developmental history that underlies the current transcriptomic landscape across tissue locations in the cortex.
Collapse
|
41
|
Lee H, Han B. FastRNA: An efficient solution for PCA of single-cell RNA-sequencing data based on a batch-accounting count model. Am J Hum Genet 2022; 109:1974-1985. [PMID: 36206757 PMCID: PMC9674949 DOI: 10.1016/j.ajhg.2022.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 09/14/2022] [Indexed: 01/26/2023] Open
Abstract
Almost always, the analysis of single-cell RNA-sequencing (scRNA-seq) data begins with the generation of the low dimensional embedding of the data by principal-component analysis (PCA). Because scRNA-seq data are count data, log transformation is routinely applied to correct skewness prior to PCA, which is often argued to have added bias to data. Alternatively, studies have proposed methods that directly assume a count model and use approximately normally distributed count residuals for PCA. Despite their theoretical advantage of directly modeling count data, these methods are extremely slow for large datasets. In fact, when the data size grows, even the standard log normalization becomes inefficient. Here, we present FastRNA, a highly efficient solution for PCA of scRNA-seq data based on a count model accounting for both batches and cell size factors. Although we assume the same general count model as previous methods, our method uses two orders of magnitude less time and memory than the other count-based methods and an order of magnitude less time and memory than the standard log normalization. This achievement results from our unique algebraic optimization that completely avoids the formation of the large dense residual matrix in memory. In addition, our method enjoys a benefit that the batch effects are eliminated from data prior to PCA. Generating a batch-accounted PC of an atlas-scale dataset with 2 million cells takes less than a minute and 1 GB memory with our method.
Collapse
Affiliation(s)
- Hanbin Lee
- Department of Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea,Corresponding author
| | - Buhm Han
- Department of Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea,Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea,Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, Republic of Korea,Genealogy Inc., Seoul, Republic of Korea,Corresponding author
| |
Collapse
|
42
|
Predicting the prevalence of lung cancer using feature transformation techniques. EGYPTIAN INFORMATICS JOURNAL 2022. [DOI: 10.1016/j.eij.2022.08.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
43
|
Li Z, Zhou X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol 2022; 23:168. [PMID: 35927760 PMCID: PMC9351148 DOI: 10.1186/s13059-022-02734-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 07/21/2022] [Indexed: 02/08/2023] Open
Abstract
Spatial transcriptomic studies are reaching single-cell spatial resolution, with data often collected from multiple tissue sections. Here, we present a computational method, BASS, that enables multi-scale and multi-sample analysis for single-cell resolution spatial transcriptomics. BASS performs cell type clustering at the single-cell scale and spatial domain detection at the tissue regional scale, with the two tasks carried out simultaneously within a Bayesian hierarchical modeling framework. We illustrate the benefits of BASS through comprehensive simulations and applications to three datasets. The substantial power gain brought by BASS allows us to reveal accurate transcriptomic and cellular landscape in both cortex and hypothalamus.
Collapse
Affiliation(s)
- Zheng Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA.,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA. .,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
44
|
Unified K-means coupled self-representation and neighborhood kernel learning for clustering single-cell RNA-sequencing data. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
45
|
Wang Y, Xu Y, Zang Z, Wu L, Li Z. Panoramic Manifold Projection (Panoramap) for Single-Cell Data Dimensionality Reduction and Visualization. Int J Mol Sci 2022; 23:7775. [PMID: 35887125 PMCID: PMC9316349 DOI: 10.3390/ijms23147775] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/03/2022] [Accepted: 07/12/2022] [Indexed: 12/22/2022] Open
Abstract
Nonlinear dimensionality reduction (NLDR) methods such as t-Distributed Stochastic Neighbour Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) have been widely used for biological data exploration, especially in single-cell analysis. However, the existing methods have drawbacks in preserving data's geometric and topological structures. A high-dimensional data analysis method, called Panoramic manifold projection (Panoramap), was developed as an enhanced deep learning framework for structure-preserving NLDR. Panoramap enhances deep neural networks by using cross-layer geometry-preserving constraints. The constraints constitute the loss for deep manifold learning and serve as geometric regularizers for NLDR network training. Therefore, Panoramap has better performance in preserving global structures of the original data. Here, we apply Panoramap to single-cell datasets and show that Panoramap excels at delineating the cell type lineage/hierarchy and can reveal rare cell types. Panoramap can facilitate trajectory inference and has the potential to aid in the early diagnosis of tumors. Panoramap gives improved and more biologically plausible visualization and interpretation of single-cell data. Panoramap can be readily used in single-cell research domains and other research fields that involve high dimensional data analysis.
Collapse
Affiliation(s)
- Yajuan Wang
- College of Mathematical Medicine, Zhejiang Normal University, Jinhua 321004, China
- School of Engineering, Westlake University, Hangzhou 310024, China; (Y.X.); (Z.Z.); (L.W.); (Z.L.)
| | - Yongjie Xu
- School of Engineering, Westlake University, Hangzhou 310024, China; (Y.X.); (Z.Z.); (L.W.); (Z.L.)
| | - Zelin Zang
- School of Engineering, Westlake University, Hangzhou 310024, China; (Y.X.); (Z.Z.); (L.W.); (Z.L.)
| | - Lirong Wu
- School of Engineering, Westlake University, Hangzhou 310024, China; (Y.X.); (Z.Z.); (L.W.); (Z.L.)
| | - Ziqing Li
- School of Engineering, Westlake University, Hangzhou 310024, China; (Y.X.); (Z.Z.); (L.W.); (Z.L.)
| |
Collapse
|
46
|
Bard JE, Nowak NJ, Buck MJ, Sinha S. Multimodal Dimension Reduction and Subtype Classification of Head and Neck Squamous Cell Tumors. Front Oncol 2022; 12:892207. [PMID: 35912202 PMCID: PMC9326399 DOI: 10.3389/fonc.2022.892207] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 06/09/2022] [Indexed: 01/18/2023] Open
Abstract
Traditional analysis of genomic data from bulk sequencing experiments seek to group and compare sample cohorts into biologically meaningful groups. To accomplish this task, large scale databases of patient-derived samples, like that of TCGA, have been established, giving the ability to interrogate multiple data modalities per tumor. We have developed a computational strategy employing multimodal integration paired with spectral clustering and modern dimension reduction techniques such as PHATE to provide a more robust method for cancer sub-type classification. Using this integrated approach, we have examined 514 Head and Neck Squamous Carcinoma (HNSC) tumor samples from TCGA across gene-expression, DNA-methylation, and microbiome data modalities. We show that these approaches, primarily developed for single-cell sequencing can be efficiently applied to bulk tumor sequencing data. Our multimodal analysis captures the dynamic heterogeneity, identifies new and refines subtypes of HNSC, and orders tumor samples along well-defined cellular trajectories. Collectively, these results showcase the inherent molecular complexity of tumors and offer insights into carcinogenesis and importance of targeted therapy. Computational techniques as highlighted in our study provide an organic and powerful approach to identify granular patterns in large and noisy datasets that may otherwise be overlooked.
Collapse
Affiliation(s)
- Jonathan E. Bard
- Department of Biochemistry, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States,Genomics and Bioinformatics Core, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States
| | - Norma J. Nowak
- Department of Biochemistry, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States,Genomics and Bioinformatics Core, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States
| | - Michael J. Buck
- Department of Biochemistry, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States,Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States,*Correspondence: Michael J. Buck, ; Satrajit Sinha,
| | - Satrajit Sinha
- Department of Biochemistry, Jacobs School of Medicine and Biomedical Sciences, State University of New York at Buffalo, Buffalo, NY, United States,*Correspondence: Michael J. Buck, ; Satrajit Sinha,
| |
Collapse
|
47
|
Ellis D, Wu D, Datta S. SAREV: A review on statistical analytics of single-cell RNA sequencing data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14:e1558. [PMID: 36034329 PMCID: PMC9400796 DOI: 10.1002/wics.1558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 04/09/2021] [Indexed: 06/15/2023]
Abstract
Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.
Collapse
Affiliation(s)
- Dorothy Ellis
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| | - Dongyuan Wu
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| | - Susmita Datta
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| |
Collapse
|
48
|
Context-aware deconvolution of cell-cell communication with Tensor-cell2cell. Nat Commun 2022; 13:3665. [PMID: 35760817 PMCID: PMC9237099 DOI: 10.1038/s41467-022-31369-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 06/14/2022] [Indexed: 12/23/2022] Open
Abstract
Cell interactions determine phenotypes, and intercellular communication is shaped by cellular contexts such as disease state, organismal life stage, and tissue microenvironment. Single-cell technologies measure the molecules mediating cell–cell communication, and emerging computational tools can exploit these data to decipher intercellular communication. However, current methods either disregard cellular context or rely on simple pairwise comparisons between samples, thus limiting the ability to decipher complex cell–cell communication across multiple time points, levels of disease severity, or spatial contexts. Here we present Tensor-cell2cell, an unsupervised method using tensor decomposition, which deciphers context-driven intercellular communication by simultaneously accounting for multiple stages, states, or locations of the cells. To do so, Tensor-cell2cell uncovers context-driven patterns of communication associated with different phenotypic states and determined by unique combinations of cell types and ligand-receptor pairs. As such, Tensor-cell2cell robustly improves upon and extends the analytical capabilities of existing tools. We show Tensor-cell2cell can identify multiple modules associated with distinct communication processes (e.g., participating cell–cell and ligand-receptor pairs) linked to severities of Coronavirus Disease 2019 and to Autism Spectrum Disorder. Thus, we introduce an effective and easy-to-use strategy for understanding complex communication patterns across diverse conditions. Cellular contexts such as disease state, organismal life stage and tissue microenvironment, shape intercellular communication, and ultimately affect an organism’s phenotypes. Here, the authors present Tensor-cell2cell, an unsupervised method for deciphering context-driven intercellular communication.
Collapse
|
49
|
Zandavi SM, Koch FC, Vijayan A, Zanini F, Mora FV, Ortega DG, Vafaee F. Disentangling single-cell omics representation with a power spectral density-based feature extraction. Nucleic Acids Res 2022; 50:5482-5492. [PMID: 35639509 PMCID: PMC9178020 DOI: 10.1093/nar/gkac436] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2021] [Revised: 04/26/2022] [Accepted: 05/10/2022] [Indexed: 12/13/2022] Open
Abstract
Emerging single-cell technologies provide high-resolution measurements of distinct cellular modalities opening new avenues for generating detailed cellular atlases of many and diverse tissues. The high dimensionality, sparsity, and inaccuracy of single cell sequencing measurements, however, can obscure discriminatory information, mask cellular subtype variations and complicate downstream analyses which can limit our understanding of cell function and tissue heterogeneity. Here, we present a novel pre-processing method (scPSD) inspired by power spectral density analysis that enhances the accuracy for cell subtype separation from large-scale single-cell omics data. We comprehensively benchmarked our method on a wide range of single-cell RNA-sequencing datasets and showed that scPSD pre-processing, while being fast and scalable, significantly reduces data complexity, enhances cell-type separation, and enables rare cell identification. Additionally, we applied scPSD to transcriptomics and chromatin accessibility cell atlases and demonstrated its capacity to discriminate over 100 cell types across the whole organism and across different modalities of single-cell omics data.
Collapse
Affiliation(s)
- Seid Miad Zandavi
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia.,Programs in Metabolism and Medical & Population Genetics, Broad Institute, Cambridge, MA, USA.,Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA.,Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Forrest C Koch
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
| | - Abhishek Vijayan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
| | - Fabio Zanini
- Prince of Wales Clinical School, UNSW Sydney, Australia.,Cellular Genomics Future Institute, UNSW Sydney, Australia
| | - Fatima Valdes Mora
- Children's Cancer Institute, Lowy Cancer Research Centre, UNSW Sydney, Australia.,School of Women's and Children's Health, Faculty of Medicine, UNSW, Sydney, Australia
| | - David Gallego Ortega
- School of Biomedical Engineering, University of Technology Sydney (UTS), Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia.,Cellular Genomics Future Institute, UNSW Sydney, Australia.,UNSW Data Science Hub (uDASH), UNSW Sydney, Australia
| |
Collapse
|
50
|
Wang Y, Peng Q, Mou X, Wang X, Li H, Han T, Sun Z, Wang X. A successful hybrid deep learning model aiming at promoter identification. BMC Bioinformatics 2022; 23:206. [PMID: 35641900 PMCID: PMC9158169 DOI: 10.1186/s12859-022-04735-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 05/16/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. RESULTS The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. CONCLUSIONS The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.
Collapse
Affiliation(s)
- Ying Wang
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| | - Qinke Peng
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China.
| | - Xu Mou
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| | - Xinyuan Wang
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| | - Haozhou Li
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| | - Tian Han
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| | - Zhao Sun
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| | - Xiao Wang
- Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
| |
Collapse
|