1
|
Zirak B, Naghipourfar M, Saberi A, Pouyabahar D, Zarezadeh A, Luo L, Fish L, Huh D, Navickas A, Sharifi-Zarchi A, Goodarzi H. Revealing the grammar of small RNA secretion using interpretable machine learning. Cell Genom 2024; 4:100522. [PMID: 38460515 PMCID: PMC11019361 DOI: 10.1016/j.xgen.2024.100522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 11/02/2023] [Accepted: 02/12/2024] [Indexed: 03/11/2024]
Abstract
Small non-coding RNAs can be secreted through a variety of mechanisms, including exosomal sorting, in small extracellular vesicles, and within lipoprotein complexes. However, the mechanisms that govern their sorting and secretion are not well understood. Here, we present ExoGRU, a machine learning model that predicts small RNA secretion probabilities from primary RNA sequences. We experimentally validated the performance of this model through ExoGRU-guided mutagenesis and synthetic RNA sequence analysis. Additionally, we used ExoGRU to reveal cis and trans factors that underlie small RNA secretion, including known and novel RNA-binding proteins (RBPs), e.g., YBX1, HNRNPA2B1, and RBM24. We also developed a novel technique called exoCLIP, which reveals the RNA interactome of RBPs within the cell-free space. Together, our results demonstrate the power of machine learning in revealing novel biological mechanisms. In addition to providing deeper insight into small RNA secretion, this knowledge can be leveraged in therapeutic and synthetic biology applications.
Collapse
Affiliation(s)
- Bahar Zirak
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, US
| | - Mohsen Naghipourfar
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Ali Saberi
- Department of Electrical and Computer Engineering, McGill University, Montreal, QC H3A 0E9, Canada; McGill Genome Centre, Victor Phillip Dahdaleh Institute of Genomic Medicine, 740 Dr Penfield Avenue, Montreal, QC H3A 0G1, Canada
| | - Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Amirhossein Zarezadeh
- Department of Stem Cells and Developmental Biology, Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR, Tehran, Iran; Department of Developmental Biology, School of Basic Sciences and Advanced Technologies in Biology, University of Science and Culture, Tehran, Iran
| | - Lixi Luo
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, US; Department of Surgical Oncology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lisa Fish
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, US
| | - Doowon Huh
- Laboratory of Systems Cancer Biology, The Rockefeller University, New York, NY, USA
| | - Albertas Navickas
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, US; Institut Curie, CNRS UMR3348, INSERM U1278, Orsay, France.
| | - Ali Sharifi-Zarchi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| | - Hani Goodarzi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, US.
| |
Collapse
|
2
|
Lotfollahi M, Klimovskaia Susmelj A, De Donno C, Hetzel L, Ji Y, Ibarra IL, Srivatsan SR, Naghipourfar M, Daza RM, Martin B, Shendure J, McFaline-Figueroa JL, Boyeau P, Wolf FA, Yakubova N, Günnemann S, Trapnell C, Lopez-Paz D, Theis FJ. Predicting cellular responses to complex perturbations in high-throughput screens. Mol Syst Biol 2023:e11517. [PMID: 37154091 DOI: 10.15252/msb.202211517] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 03/23/2023] [Accepted: 03/31/2023] [Indexed: 05/10/2023] Open
Abstract
Recent advances in multiplexed single-cell transcriptomics experiments facilitate the high-throughput study of drug and genetic perturbations. However, an exhaustive exploration of the combinatorial perturbation space is experimentally unfeasible. Therefore, computational methods are needed to predict, interpret, and prioritize perturbations. Here, we present the compositional perturbation autoencoder (CPA), which combines the interpretability of linear models with the flexibility of deep-learning approaches for single-cell response modeling. CPA learns to in silico predict transcriptional perturbation response at the single-cell level for unseen dosages, cell types, time points, and species. Using newly generated single-cell drug combination data, we validate that CPA can predict unseen drug combinations while outperforming baseline models. Additionally, the architecture's modularity enables incorporating the chemical representation of the drugs, allowing the prediction of cellular response to completely unseen drugs. Furthermore, CPA is also applicable to genetic combinatorial screens. We demonstrate this by imputing in silico 5,329 missing combinations (97.6% of all possibilities) in a single-cell Perturb-seq experiment with diverse genetic interactions. We envision CPA will facilitate efficient experimental design and hypothesis generation by enabling in silico response prediction at the single-cell level and thus accelerate therapeutic applications using single-cell technologies.
Collapse
Affiliation(s)
- Mohammad Lotfollahi
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | | | - Carlo De Donno
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Leon Hetzel
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Yuge Ji
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Ignacio L Ibarra
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
| | - Sanjay R Srivatsan
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | - Riza M Daza
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Beth Martin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA
| | | | - Pierre Boyeau
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - F Alexander Wolf
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
| | | | - Stephan Günnemann
- Department of Computer Science, Technical University of Munich, Munich, Germany
| | - Cole Trapnell
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA
| | - David Lopez-Paz
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Fabian J Theis
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| |
Collapse
|
3
|
Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Büttner M, Wagenstetter M, Avsec Ž, Gayoso A, Yosef N, Interlandi M, Rybakov S, Misharin AV, Theis FJ. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 2022; 40:121-130. [PMID: 34462589 PMCID: PMC8763644 DOI: 10.1038/s41587-021-01001-7] [Citation(s) in RCA: 147] [Impact Index Per Article: 73.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 06/28/2021] [Indexed: 02/07/2023]
Abstract
Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.
Collapse
Affiliation(s)
- Mohammad Lotfollahi
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Mohsen Naghipourfar
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Malte D Luecken
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Matin Khajavi
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Maren Büttner
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Marco Wagenstetter
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Žiga Avsec
- Department of Computer Science, Technical University of Munich, Munich, Germany
| | - Adam Gayoso
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Nir Yosef
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Ragon Institute of MGH, MIT and Harvard, Cambridge, MA, USA
| | - Marta Interlandi
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sergei Rybakov
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Alexander V Misharin
- Division of Pulmonary and Critical Care Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Fabian J Theis
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
- Department of Mathematics, Technical University of Munich, Munich, Germany.
| |
Collapse
|
4
|
Abstract
Abstract
Motivation
While generative models have shown great success in sampling high-dimensional samples conditional on low-dimensional descriptors (stroke thickness in MNIST, hair color in CelebA, speaker identity in WaveNet), their generation out-of-distribution poses fundamental problems due to the difficulty of learning compact joint distribution across conditions. The canonical example of the conditional variational autoencoder (CVAE), for instance, does not explicitly relate conditions during training and, hence, has no explicit incentive of learning such a compact representation.
Results
We overcome the limitation of the CVAE by matching distributions across conditions using maximum mean discrepancy in the decoder layer that follows the bottleneck. This introduces a strong regularization both for reconstructing samples within the same condition and for transforming samples across conditions, resulting in much improved generalization. As this amount to solving a style-transfer problem, we refer to the model as transfer VAE (trVAE). Benchmarking trVAE on high-dimensional image and single-cell RNA-seq, we demonstrate higher robustness and higher accuracy than existing approaches. We also show qualitatively improved predictions by tackling previously problematic minority classes and multiple conditions in the context of cellular perturbation response to treatment and disease based on high-dimensional single-cell gene expression data. For generic tasks, we improve Pearson correlations of high-dimensional estimated means and variances with their ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively. We further demonstrate that trVAE learns cell-type-specific responses after perturbation and improves the prediction of most cell-type-specific genes by 65%.
Availability and implementation
The trVAE implementation is available via github.com/theislab/trvae. The results of this article can be reproduced via github.com/theislab/trvae_reproducibility.
Collapse
Affiliation(s)
- Mohammad Lotfollahi
- Institute of Computational Biology, Helmholtz Center Munich, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Mohsen Naghipourfar
- Institute of Computational Biology, Helmholtz Center Munich, Neuherberg, Germany
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Center Munich, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
- Department of Mathematics, Technische Universität München, Munich, Germany
| | - F Alexander Wolf
- Institute of Computational Biology, Helmholtz Center Munich, Neuherberg, Germany
| |
Collapse
|