1
|
Wang Y, Chen X, Zheng Z, Huang L, Xie W, Wang F, Zhang Z, Wong KC. scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics. iScience 2024; 27:109352. [PMID: 38510148 PMCID: PMC10951644 DOI: 10.1016/j.isci.2024.109352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 12/29/2023] [Accepted: 02/23/2024] [Indexed: 03/22/2024] Open
Abstract
Gene regulatory networks (GRNs) involve complex and multi-layer regulatory interactions between regulators and their target genes. Precise knowledge of GRNs is important in understanding cellular processes and molecular functions. Recent breakthroughs in single-cell sequencing technology made it possible to infer GRNs at single-cell level. Existing methods, however, are limited by expensive computations, and sometimes simplistic assumptions. To overcome these obstacles, we propose scGREAT, a framework to infer GRN using gene embeddings and transformer from single-cell transcriptomics. scGREAT starts by constructing gene expression and gene biotext dictionaries from scRNA-seq data and gene text information. The representation of TF gene pairs is learned through optimizing embedding space by transformer-based engine. Results illustrated scGREAT outperformed other contemporary methods on benchmarks. Besides, gene representations from scGREAT provide valuable gene regulation insights, and external validation on spatial transcriptomics illuminated the mechanism behind scGREAT annotation. Moreover, scGREAT identified several TF target regulations corroborated in studies.
Collapse
Affiliation(s)
- Yuchen Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Cutaneous Biology Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Zetian Zheng
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Lei Huang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Weidun Xie
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zhaolei Zhang
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China
| |
Collapse
|
2
|
Tsimenidis S, Vrochidou E, Papakostas GA. Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. Int J Mol Sci 2022; 23:12272. [PMID: 36293133 PMCID: PMC9603455 DOI: 10.3390/ijms232012272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 10/03/2022] [Accepted: 10/12/2022] [Indexed: 11/25/2022] Open
Abstract
Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.
Collapse
Affiliation(s)
| | | | - George A. Papakostas
- MLV Research Group, Department of Computer Science, International Hellenic University, 65404 Kavala, Greece
| |
Collapse
|
3
|
Yu XT, Chen M, Guo J, Zhang J, Zeng T. Noninvasive detection and interpretation of gastrointestinal diseases by collaborative serum metabolite and magnetically controlled capsule endoscopy. Comput Struct Biotechnol J 2022; 20:5524-5534. [PMID: 36249561 PMCID: PMC9550535 DOI: 10.1016/j.csbj.2022.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 09/15/2022] [Accepted: 10/02/2022] [Indexed: 11/16/2022] Open
Abstract
Gastrointestinal diseases are complex diseases that occur in the gastrointestinal tract. Common gastrointestinal diseases include chronic gastritis, peptic ulcers, inflammatory bowel disease, and gastrointestinal tumors. These diseases may manifest a long course, difficult treatment, and repeated attacks. Gastroscopy and mucosal biopsy are the gold standard methods for diagnosing gastric and duodenal diseases, but they are invasive procedures and carry risks due to the necessity of sedation and anesthesia. Recently, several new approaches have been developed, including serological examination and magnetically controlled capsule endoscopy (MGCE). However, serological markers lack lesion information, while MGCE images lack molecular information. This study proposes combining these two technologies in a collaborative noninvasive diagnostic scheme as an alternative to the standard procedures. We introduce an interpretable framework for the clinical diagnosis of gastrointestinal diseases. Based on collected blood samples and MGCE records of patients with gastrointestinal diseases and comparisons with normal individuals, we selected serum metabolite signatures by bioinformatic analysis, captured image embedding signatures by convolutional neural networks, and inferred the location-specific associations between these signatures. Our study successfully identified five key metabolite signatures with functional relevance to gastrointestinal disease. The combined signatures achieved discrimination AUC of 0.88. Meanwhile, the image embedding signatures showed different levels of validation and testing accuracy ranging from 0.7 to 0.9 according to different locations in the gastrointestinal tract as explained by their specific associations with metabolite signatures. Overall, our work provides a new collaborative noninvasive identification pipeline and candidate metabolite biomarkers for image auxiliary diagnosis. This method should be valuable for the noninvasive detection and interpretation of gastrointestinal and other complex diseases.
Collapse
Affiliation(s)
- Xiang-Tian Yu
- Clinical Research Center, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai, China,Corresponding authors at: Clinical Research Center, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Yishan Road 600, Shanghai, China (X.-T. Yu); Guangzhou Laboratory, Guangzhou, China (T. Zeng).
| | - Ming Chen
- Department of Gastroenterology, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China
| | - Jingyi Guo
- Clinical Research Center, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai, China
| | - Jing Zhang
- Department of Gastroenterology, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China
| | - Tao Zeng
- Guangzhou Laboratory, Guangzhou, China,Corresponding authors at: Clinical Research Center, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Yishan Road 600, Shanghai, China (X.-T. Yu); Guangzhou Laboratory, Guangzhou, China (T. Zeng).
| |
Collapse
|
4
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
5
|
Zhang C, Chen Y, Zeng T, Zhang C, Chen L. Deep latent space fusion for adaptive representation of heterogeneous multi-omics data. Brief Bioinform 2022; 23:6515231. [PMID: 35079777 DOI: 10.1093/bib/bbab600] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 12/23/2021] [Accepted: 12/26/2021] [Indexed: 01/01/2023] Open
Abstract
The integration of multi-omics data makes it possible to understand complex biological organisms at the system level. Numerous integration approaches have been developed by assuming a common underlying data space. Due to the noise and heterogeneity of biological data, the performance of these approaches is greatly affected. In this work, we propose a novel deep neural network architecture, named Deep Latent Space Fusion (DLSF), which integrates the multi-omics data by learning consistent manifold in the sample latent space for disease subtypes identification. DLSF is built upon a cycle autoencoder with a shared self-expressive layer, which can naturally and adaptively merge nonlinear features at each omics level into one unified sample manifold and produce adaptive representation of heterogeneous samples at the multi-omics level. We have assessed DLSF on various biological and biomedical datasets to validate its effectiveness. DLSF can efficiently and accurately capture the intrinsic manifold of the sample structures or sample clusters compared with other state-of-the-art methods, and DLSF yielded more significant outcomes for biological significance, survival prognosis and clinical relevance in application of cancer study in The Cancer Genome Atlas. Notably, as a deep case study, we determined a new molecular subtype of kidney renal clear cell carcinoma that may benefit immunotherapy in the viewpoint of multi-omics, and we further found potential subtype-specific biomarkers from multiple omics data, which were validated by independent datasets. In addition, we applied DLSF to identify potential therapeutic agents of different molecular subtypes of chronic lymphocytic leukemia, demonstrating the scalability of DLSF in diverse omics data types and application scenarios.
Collapse
Affiliation(s)
- Chengming Zhang
- State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
| | - Yabin Chen
- State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Tao Zeng
- State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- Guangzhou Laboratory, Guangzhou, China
| | - Chuanchao Zhang
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong 519031, China
| | - Luonan Chen
- State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong 519031, China
| |
Collapse
|
6
|
Interpretable Autoencoders Trained on Single Cell Sequencing Data Can Transfer Directly to Data from Unseen Tissues. Cells 2021; 11:cells11010085. [PMID: 35011647 PMCID: PMC8750521 DOI: 10.3390/cells11010085] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 12/17/2021] [Accepted: 12/21/2021] [Indexed: 01/04/2023] Open
Abstract
Autoencoders have been used to model single-cell mRNA-sequencing data with the purpose of denoising, visualization, data simulation, and dimensionality reduction. We, and others, have shown that autoencoders can be explainable models and interpreted in terms of biology. Here, we show that such autoencoders can generalize to the extent that they can transfer directly without additional training. In practice, we can extract biological modules, denoise, and classify data correctly from an autoencoder that was trained on a different dataset and with different cells (a foreign model). We deconvoluted the biological signal encoded in the bottleneck layer of scRNA-models using saliency maps and mapped salient features to biological pathways. Biological concepts could be associated with specific nodes and interpreted in relation to biological pathways. Even in this unsupervised framework, with no prior information about cell types or labels, the specific biological pathways deduced from the model were in line with findings in previous research. It was hypothesized that autoencoders could learn and represent meaningful biology; here, we show with a systematic experiment that this is true and even transcends the training data. This means that carefully trained autoencoders can be used to assist the interpretation of new unseen data.
Collapse
|
7
|
Yin Q, Wang Y, Guan J, Ji G. scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data. Brief Bioinform 2021; 23:6463428. [PMID: 34913057 DOI: 10.1093/bib/bbab508] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Revised: 10/28/2021] [Accepted: 11/04/2021] [Indexed: 12/12/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) allows quantitative analysis of gene expression at the level of single cells, beneficial to study cell heterogeneity. The recognition of cell types facilitates the construction of cell atlas in complex tissues or organisms, which is the basis of almost all downstream scRNA-seq data analyses. Using disease-related scRNA-seq data to perform the prediction of disease status can facilitate the specific diagnosis and personalized treatment of disease. Since single-cell gene expression data are high-dimensional and sparse with dropouts, we propose scIAE, an integrative autoencoder-based ensemble classification framework, to firstly perform multiple random projections and apply integrative and devisable autoencoders (integrating stacked, denoising and sparse autoencoders) to obtain compressed representations. Then base classifiers are built on the lower-dimensional representations and the predictions from all base models are integrated. The comparison of scIAE and common feature extraction methods shows that scIAE is effective and robust, independent of the choice of dimension, which is beneficial to subsequent cell classification. By testing scIAE on different types of data and comparing it with existing general and single-cell-specific classification methods, it is proven that scIAE has a great classification power in cell type annotation intradataset, across batches, across platforms and across species, and also disease status prediction. The architecture of scIAE is flexible and devisable, and it is available at https://github.com/JGuan-lab/scIAE.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Yang Wang
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China
| | - Jinting Guan
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China
| |
Collapse
|
8
|
Zhang Z, Pan Q, Ge H, Xing L, Hong Y, Chen P. Deep learning-based clustering robustly identified two classes of sepsis with both prognostic and predictive values. EBioMedicine 2020; 62:103081. [PMID: 33181462 PMCID: PMC7658497 DOI: 10.1016/j.ebiom.2020.103081] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 09/19/2020] [Accepted: 10/07/2020] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Sepsis is a heterogenous syndrome and individualized management strategy is the key to successful treatment. Genome wide expression profiling has been utilized for identifying subclasses of sepsis, but the clinical utility of these subclasses was limited because of the classification instability, and the lack of a robust class prediction model with extensive external validation. The study aimed to develop a parsimonious class model for the prediction of class membership and validate the model for its prognostic and predictive capability in external datasets. METHODS The Gene Expression Omnibus (GEO) and ArrayExpress databases were searched from inception to April 2020. Datasets containing whole blood gene expression profiling in adult sepsis patients were included. Autoencoder was used to extract representative features for k-means clustering. Genetic algorithms (GA) were employed to derive a parsimonious 5-gene class prediction model. The class model was then applied to external datasets (n = 780) to evaluate its prognostic and predictive performance. FINDINGS A total of 12 datasets involving 1613 patients were included. Two classes were identified in the discovery cohort (n = 685). Class 1 was characterized by immunosuppression with higher mortality than class 2 (21.8% [70/321] vs. 12.1% [44/364]; p < 0.01 for Chi-square test). A 5-gene class model (C14orf159, AKNA, PILRA, STOM and USP4) was developed with GA. In external validation cohorts, the 5-gene class model (AUC: 0.707; 95% CI: 0.664 - 0.750) performed better in predicting mortality than sepsis response signature (SRS) endotypes (AUC: 0.610; 95% CI: 0.521 - 0.700), and performed equivalently to the APACHE II score (AUC: 0.681; 95% CI: 0.595 - 0.767). In the dataset E-MTAB-7581, the use of hydrocortisone was associated with increased risk of mortality (OR: 3.15 [1.13, 8.82]; p = 0.029) in class 2. The effect was not statistically significant in class 1 (OR: 1.88 [0.70, 5.09]; p = 0.211). INTERPRETATION Our study identified two classes of sepsis that showed different mortality rates and responses to hydrocortisone therapy. Class 1 was characterized by immunosuppression with higher mortality rate than class 2. We further developed a 5-gene class model to predict class membership. FUNDING The study was funded by the National Natural Science Foundation of China (Grant No. 81,901,929).
Collapse
Affiliation(s)
- Zhongheng Zhang
- Department of Emergency Medicine, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, 310016, China.
| | - Qing Pan
- College of Information Engineering, Zhejiang University of Technology, 310023, Hangzhou, China.
| | - Huiqing Ge
- Department of Respiratory Care, Sir Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China.
| | - Lifeng Xing
- Department of Emergency Medicine, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, 310016, China.
| | - Yucai Hong
- Department of Emergency Medicine, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, 310016, China
| | - Pengpeng Chen
- Department of Emergency Medicine, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, 310016, China.
| |
Collapse
|