1
|
Hu M, Chikina M. Heterogeneous pseudobulk simulation enables realistic benchmarking of cell-type deconvolution methods. Genome Biol 2024; 25:169. [PMID: 38956606 PMCID: PMC11218230 DOI: 10.1186/s13059-024-03292-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 05/29/2024] [Indexed: 07/04/2024] Open
Abstract
BACKGROUND Computational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. RESULTS In our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. CONCLUSIONS Our heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly package https://github.com/humengying0907/deconvBenchmarking and https://doi.org/10.5281/zenodo.8206516 , enabling further developments in deconvolution methods.
Collapse
Affiliation(s)
- Mengying Hu
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, USA
- Joint Carnegie Mellon - University of Pittsburgh Computational Biology PhD Program, University of Pittsburgh, Pittsburgh, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, USA.
- Joint Carnegie Mellon - University of Pittsburgh Computational Biology PhD Program, University of Pittsburgh, Pittsburgh, USA.
| |
Collapse
|
2
|
BOLLON JORDY, SHORTREED MICHAELR, JORDAN BENT, MILLER RACHEL, JEFFERY ERIN, CAVALLI ANDREA, SMITH LLOYDM, DEWEY COLIN, SHEYNKMAN GLORIAM, TIBERI SIMONE. IsoBayes: a Bayesian approach for single-isoform proteomics inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.10.598223. [PMID: 38915658 PMCID: PMC11195044 DOI: 10.1101/2024.06.10.598223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Studying protein isoforms is an essential step in biomedical research; at present, the main approach for analyzing proteins is via bottom-up mass spectrometry proteomics, which return peptide identifications, that are indirectly used to infer the presence of protein isoforms. However, the detection and quantification processes are noisy; in particular, peptides may be erroneously detected, and most peptides, known as shared peptides, are associated to multiple protein isoforms. As a consequence, studying individual protein isoforms is challenging, and inferred protein results are often abstracted to the gene-level or to groups of protein isoforms. Here, we introduce IsoBayes, a novel statistical method to perform inference at the isoform level. Our method enhances the information available, by integrating mass spectrometry proteomics and transcriptomics data in a Bayesian probabilistic framework. To account for the uncertainty in the measurement process, we propose a two-layer latent variable approach: first, we sample if a peptide has been correctly detected (or, alternatively filter peptides); second, we allocate the abundance of such selected peptides across the protein(s) they are compatible with. This enables us, starting from peptide-level data, to recover protein-level data; in particular, we: i) infer the presence/absence of each protein isoform (via a posterior probability), ii) estimate its abundance (and credible interval), and iii) target isoforms where transcript and protein relative abundances significantly differ. We benchmarked our approach in simulations, and in two multi-protease real datasets: our method displays good sensitivity and specificity when detecting protein isoforms, its estimated abundances highly correlate with the ground truth, and can detect changes between protein and transcript relative abundances. IsoBayes is freely distributed as a Bioconductor R package, and is accompanied by an example usage vignette.
Collapse
Affiliation(s)
- JORDY BOLLON
- Computational and Chemical Biology, Italian Institute of Technology, CMPVdA, Aosta, Italy
- Astronomical Observatory of the Autonomous Region of the Aosta Valley (OAVdA), Nus, Italy
| | | | - BEN T JORDAN
- Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - RACHEL MILLER
- Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - ERIN JEFFERY
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - ANDREA CAVALLI
- Computational and Chemical Biology, Italian Institute of Technology, CMPVdA, Aosta, Italy
- Centre Européen de Calcul Atomique et Moléculaire, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - LLOYD M SMITH
- Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - COLIN DEWEY
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| | - GLORIA M SHEYNKMAN
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - SIMONE TIBERI
- Department of Statistical Sciences, University of Bologna, Bologna, Italy
| |
Collapse
|
3
|
Ornelas MY, Ouyang WO, Wu NC. A library-on-library screen reveals the breadth expansion landscape of a broadly neutralizing betacoronavirus antibody. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.06.597810. [PMID: 38915656 PMCID: PMC11195093 DOI: 10.1101/2024.06.06.597810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Broadly neutralizing antibodies (bnAbs) typically evolve cross-reactivity breadth through acquiring somatic hypermutations. While evolution of breadth requires improvement of binding to multiple antigenic variants, most experimental evolution platforms select against only one antigenic variant at a time. In this study, a yeast display library-on-library approach was applied to delineate the affinity maturation of a betacoronavirus bnAb, S2P6, against 27 spike stem helix peptides in a single experiment. Our results revealed that the binding affinity landscape of S2P6 varies among different stem helix peptides. However, somatic hypermutations that confer general improvement in binding affinity across different stem helix peptides could also be identified. We further showed that a key somatic hypermutation for breadth expansion involves long-range interaction. Overall, our work not only provides a proof-of-concept for using a library-on-library approach to analyze the evolution of antibody breadth, but also has important implications for the development of broadly protective vaccines.
Collapse
Affiliation(s)
- Marya Y. Ornelas
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Wenhao O. Ouyang
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nicholas C. Wu
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
4
|
Salcedo-Tacuma D, Howells G, Mchose C, Gutierrez-Diaz A, Schupp J, Smith DM. ProEnd: A Comprehensive Database for Identifying HbYX Motif-Containing Proteins Across the Tree of Life. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.08.598080. [PMID: 38895466 PMCID: PMC11185799 DOI: 10.1101/2024.06.08.598080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
The proteasome plays a crucial role in cellular homeostasis by degrading misfolded, damaged, or unnecessary proteins. Understanding the regulatory mechanisms of proteasome activity is vital, particularly the interaction with activators containing the hydrophobic-tyrosine-any amino acid (HbYX) motif. Here, we present ProEnd, a comprehensive database designed to identify and catalog HbYX motif-containing proteins across the tree of life. Using a simple bioinformatics pipeline, we analyzed approximately 73 million proteins from 22,000 reference proteomes in the UniProt/SwissProt database. Our findings reveal the widespread presence of HbYX motifs in diverse organisms, highlighting their evolutionary conservation and functional significance. Notably, we observed an interesting prevalence of these motifs in viral proteomes, suggesting strategic interactions with the host proteasome. As validation two novel HbYX proteins found in this database were tested and found to directly interact with the proteasome. ProEnd's extensive dataset and user-friendly interface enable researchers to explore the potential proteasomal regulator landscape, generating new hypotheses to advance proteasome biology. This resource is set to facilitate the discovery of novel therapeutic targets, enhancing our approach to treating diseases such as neurodegenerative disorders and cancer. Link: http://proend.org/.
Collapse
Affiliation(s)
- David Salcedo-Tacuma
- Department of Biochemistry and Molecular Medicine, West Virginia University School of Medicine, 4 Medical Center Dr., Morgantown, WV USA
| | - Giovanni Howells
- Department of Biochemistry and Molecular Medicine, West Virginia University School of Medicine, 4 Medical Center Dr., Morgantown, WV USA
| | - Coleman Mchose
- Department of Biochemistry and Molecular Medicine, West Virginia University School of Medicine, 4 Medical Center Dr., Morgantown, WV USA
| | - Aimer Gutierrez-Diaz
- Department of Plant Biology, Uppsala BioCenter, Swedish University of Agricultural Sciences, Uppsala 75007, Sweden
| | - Jane Schupp
- Department of Biochemistry and Molecular Medicine, West Virginia University School of Medicine, 4 Medical Center Dr., Morgantown, WV USA
| | - David M. Smith
- Department of Biochemistry and Molecular Medicine, West Virginia University School of Medicine, 4 Medical Center Dr., Morgantown, WV USA
- Department of Neuroscience, Rockefeller Neuroscience Institute, West Virginia University, Morgantown, West Virginia, USA
| |
Collapse
|
5
|
Li X, Chen K, Shao M. Efficient Seeding for Error-Prone Sequences with SubseqHash2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.30.596711. [PMID: 38895288 PMCID: PMC11185578 DOI: 10.1101/2024.05.30.596711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Seeding is an essential preparatory step for large-scale sequence comparisons. Substring-based seeding methods such as kmers are ideal for sequences with low error rates but struggle to achieve high sensitivity while maintaining a reasonable precision for error-prone long reads. SubseqHash, a novel subsequence-based seeding method we recently developed, achieves superior accuracy to substring-based methods in seeding sequences with high mutation/error rates, while the only drawback is its computation speed. In this paper, we propose SubseqHash2, an improved algorithm that can compute multiple sets of seeds in one run by defining k orders over all length- k subsequences and identifying the optimal subsequence under each of the k orders in a single dynamic programming framework. The algorithm is further accelerated using SIMD instructions. SubseqHash2 achieves a 10-50× speedup over repeating SubseqHash while maintaining the high accuracy of seeds. We demonstrate that SubseqHash2 drastically outperforms popular substring-based methods including kmers, minimizers, syncmers, and Strobemers for three fundamental applications. In read mapping, SubseqHash2 can generate adequate seed-matches for aligning hard reads that minimap2 fails on. In sequence alignment, SubseqHash2 achieves high coverage of correct seeds and low coverage of incorrect seeds. In overlap detection, seeds produced by SubseqHash2 lead to more correct overlapping pairs at the same false-positive rate. With all the algorithmic breakthroughs of SubseqHash2, we clear the path for the wide adoption of subsequence-based seeds in long-read analysis. SubseqHash2 is available at https://github.com/Shao-Group/SubseqHash2.
Collapse
Affiliation(s)
- Xiang Li
- Department of Computer Science and Engineering, The Pennsylvania State University, United States
| | - Ke Chen
- Department of Computer Science and Engineering, The Pennsylvania State University, United States
| | - Mingfu Shao
- Department of Computer Science and Engineering, The Pennsylvania State University, United States
- Huck Institutes of the Life Science, The Pennsylvania State University, United Statess
| |
Collapse
|
6
|
Duo H, Li Y, Lan Y, Tao J, Yang Q, Xiao Y, Sun J, Li L, Nie X, Zhang X, Liang G, Liu M, Hao Y, Li B. Systematic evaluation with practical guidelines for single-cell and spatially resolved transcriptomics data simulation under multiple scenarios. Genome Biol 2024; 25:145. [PMID: 38831386 PMCID: PMC11149245 DOI: 10.1186/s13059-024-03290-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines. RESULTS We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation. CONCLUSIONS No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.
Collapse
Affiliation(s)
- Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Yinghong Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, People's Republic of China
| | - Yang Lan
- Institute of Pathology and Southwest Cancer Center, Southwest Hospital, Army Medical University, Chongqing, 400038, People's Republic of China
| | - Jingxin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310058, People's Republic of China
| | - Yingxue Xiao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Jing Sun
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Lei Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Xiaoxi Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Mingwei Liu
- Key Laboratory of Clinical Laboratory Diagnostics, College of Laboratory Medicine, Chongqing Medical University, Chongqing, 400016, People's Republic of China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| |
Collapse
|
7
|
Rivero-Garcia I, Torres M, Sánchez-Cabo F. Deep generative models in single-cell omics. Comput Biol Med 2024; 176:108561. [PMID: 38749321 DOI: 10.1016/j.compbiomed.2024.108561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 04/30/2024] [Accepted: 05/05/2024] [Indexed: 05/31/2024]
Abstract
Deep Generative Models (DGMs) are becoming instrumental for inferring probability distributions inherent to complex processes, such as most questions in biomedical research. For many years, there was a lack of mathematical methods that would allow this inference in the scarce data scenario of biomedical research. The advent of single-cell omics has finally made square the so-called "skinny matrix", allowing to apply mathematical methods already extensively used in other areas. Moreover, it is now possible to integrate data at different molecular levels in thousands or even millions of samples, thanks to the number of single-cell atlases being collaboratively generated. Additionally, DGMs have proven useful in other frequent tasks in single-cell analysis pipelines, from dimensionality reduction, cell type annotation to RNA velocity inference. In spite of its promise, DGMs need to be used with caution in biomedical research, paying special attention to its use to answer the right questions and the definition of appropriate error metrics and validation check points that confirm not only its correct use but also its relevance. All in all, DGMs provide an exciting tool that opens a bright future for the integrative analysis of single-cell -omics to understand health and disease.
Collapse
Affiliation(s)
- Inés Rivero-Garcia
- Universidad Politécnica de Madrid, Madrid, 28040, Spain; Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, 28029, Spain
| | - Miguel Torres
- Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, 28029, Spain
| | - Fátima Sánchez-Cabo
- Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, 28029, Spain.
| |
Collapse
|
8
|
Cottrell S, Hozumi Y, Wei GW. K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis. Comput Biol Med 2024; 175:108497. [PMID: 38678944 PMCID: PMC11090715 DOI: 10.1016/j.compbiomed.2024.108497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 04/08/2024] [Accepted: 04/21/2024] [Indexed: 05/01/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
Collapse
Affiliation(s)
- Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|
9
|
Bolteau M, Chebouba L, David L, Bourdon J, Guziolowski C. Boolean Network Models of Human Preimplantation Development. J Comput Biol 2024; 31:513-523. [PMID: 38814745 DOI: 10.1089/cmb.2024.0517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024] Open
Abstract
Single-cell transcriptomic studies of differentiating systems allow meaningful understanding, especially in human embryonic development and cell fate determination. We present an innovative method aimed at modeling these intricate processes by leveraging scRNAseq data from various human developmental stages. Our implemented method identifies pseudo-perturbations, since actual perturbations are unavailable due to ethical and technical constraints. By integrating these pseudo-perturbations with prior knowledge of gene interactions, our framework generates stage-specific Boolean networks (BNs). We apply our method to medium and late trophectoderm developmental stages and identify 20 pseudo-perturbations required to infer BNs. The resulting BN families delineate distinct regulatory mechanisms, enabling the differentiation between these developmental stages. We show that our program outperforms existing pseudo-perturbation identification tool. Our framework contributes to comprehending human developmental processes and holds potential applicability to diverse developmental stages and other research scenarios.
Collapse
Affiliation(s)
- Mathieu Bolteau
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000, Nantes, France
| | - Lokmane Chebouba
- Department of Electronics, University of Frères Mentouri Constantine 1, Constantine, Algeria
- LRIA Laboratory, University of Science and Technology Houari Boumediene (USTHB), Bab-Ezzouar, Algeria
| | - Laurent David
- Nantes Université, CHU Nantes, INSERM, Center for Research in Transplantation and Translational Immunology, UMR 1064, F-44000, Nantes, France
| | - Jérémie Bourdon
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000, Nantes, France
| | - Carito Guziolowski
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000, Nantes, France
| |
Collapse
|
10
|
Zanfardino M, Franzese M, Geraci F. DeClUt: Decluttering differentially expressed genes through clustering of their expression profiles. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 254:108258. [PMID: 38851122 DOI: 10.1016/j.cmpb.2024.108258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 04/26/2024] [Accepted: 05/29/2024] [Indexed: 06/10/2024]
Abstract
BACKGROUND AND OBJECTIVE differential expression analysis is one of the most popular activities in transcriptomic studies based on next-generation sequencing technologies. In fact, differentially expressed genes (DEGs) between two conditions represent ideal prognostic and diagnostic candidate biomarkers for many pathologies. As a result, several algorithms, such as DESeq2 and edgeR, have been developed to identify DEGs. Despite their widespread use, there is no consensus on which model performs best for different types of data, and many existing methods suffer from high False Discovery Rates (FDR). METHODS we present a new algorithm, DeClUt, based on the intuition that the expression profile of differentially expressed genes should form two reasonably compact and well-separated clusters. This, in turn, implies that the bipartition induced by the two conditions being compared should overlap with the clustering. The clustering algorithm underlying DeClUt was designed to be robust to outliers typical of RNA-seq data. In particular, we used the average silhouette function to enforce membership assignment of samples to the most appropriate condition. RESULTS DeClUt was tested on real RNA-seq datasets and benchmarked against four of the most widely used methods (edgeR, DESeq2, NOISeq, and SAMseq). Experiments showed a higher self-consistency of results than the competitors as well as a significantly lower False Positive Rate (FPR). Moreover, tested on a real prostate cancer RNA-seq dataset, DeClUt has highlighted 8 DE genes, linked to neoplastic process according to DisGeNET database, that none of the other methods had identified. CONCLUSIONS our work presents a novel algorithm that builds upon basic concepts of data clustering and exhibits greater consistency and significantly lower False Positive Rate than state-of-the-art methods. Additionally, DeClUt is able to highlight relevant differentially expressed genes not otherwise identified by other tools contributing to improve efficacy of differential expression analyses in various biological applications.
Collapse
Affiliation(s)
| | - Monica Franzese
- IRCCS Synlab SDN, Via E. Gianturco, 113, Naples, 80143, Italy.
| | - Filippo Geraci
- Institute for Informatics and Telematics, CNR, Via G. Moruzzi 1, Pisa, 56124, Italy
| |
Collapse
|
11
|
Abegaz F, Abedini D, White F, Guerrieri A, Zancarini A, Dong L, Westerhuis JA, van Eeuwijk F, Bouwmeester H, Smilde AK. A strategy for differential abundance analysis of sparse microbiome data with group-wise structured zeros. Sci Rep 2024; 14:12433. [PMID: 38816496 PMCID: PMC11139916 DOI: 10.1038/s41598-024-62437-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 05/16/2024] [Indexed: 06/01/2024] Open
Abstract
Comparing the abundance of microbial communities between different groups or obtained under different experimental conditions using count sequence data is a challenging task due to various issues such as inflated zero counts, overdispersion, and non-normality. Several methods and procedures based on counts, their transformation and compositionality have been proposed in the literature to detect differentially abundant species in datasets containing hundreds to thousands of microbial species. Despite efforts to address the large numbers of zeros present in microbiome datasets, even after careful data preprocessing, the performance of existing methods is impaired by the presence of inflated zero counts and group-wise structured zeros (i.e. all zero counts in a group). We propose and validate using extensive simulations an approach combining two differential abundance testing methods, namely DESeq2-ZINBWaVE and DESeq2, to address the issues of zero-inflation and group-wise structured zeros, respectively. This combined approach was subsequently successfully applied to two plant microbiome datasets that revealed a number of taxa as interesting candidates for further experimental validation.
Collapse
Affiliation(s)
- Fentaw Abegaz
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands.
- Biometris, Wageningen University & Research, 6708 PB, Wageningen, The Netherlands.
| | - Davar Abedini
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Fred White
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Alessandra Guerrieri
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Anouk Zancarini
- IGEPP, INRAE, Institut Agro, Univ Rennes, 35653, Le Rheu, France
| | - Lemeng Dong
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Johan A Westerhuis
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Fred van Eeuwijk
- Biometris, Wageningen University & Research, 6708 PB, Wageningen, The Netherlands
| | - Harro Bouwmeester
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Age K Smilde
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| |
Collapse
|
12
|
Ferriera Neres D, Wright RC. Pleiotropy, a feature or a bug? Toward co-ordinating plant growth, development, and environmental responses through engineering plant hormone signaling. Curr Opin Biotechnol 2024; 88:103151. [PMID: 38823314 DOI: 10.1016/j.copbio.2024.103151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 05/10/2024] [Accepted: 05/14/2024] [Indexed: 06/03/2024]
Abstract
The advent of gene editing technologies such as CRISPR has simplified co-ordinating trait development. However, identifying candidate genes remains a challenge due to complex gene networks and pathways. These networks exhibit pleiotropy, complicating the determination of specific gene and pathway functions. In this review, we explore how systems biology and single-cell sequencing technologies can aid in identifying candidate genes for co-ordinating specifics of plant growth and development within specific temporal and tissue contexts. Exploring sequence-function space of these candidate genes and pathway modules with synthetic biology allows us to test hypotheses and define genotype-phenotype relationships through reductionist approaches. Collectively, these techniques hold the potential to advance breeding and genetic engineering strategies while also addressing genetic diversity issues critical for adaptation and trait development.
Collapse
Affiliation(s)
- Deisiany Ferriera Neres
- Biological Systems Engineering, Virginia Polytechnic Institute and State University, Blackburg, Virginia, United States; Translational Plant Science Center, Virginia Polytechnic Institute and State University, Blackburg, Virginia, United States
| | - R Clay Wright
- Biological Systems Engineering, Virginia Polytechnic Institute and State University, Blackburg, Virginia, United States; Translational Plant Science Center, Virginia Polytechnic Institute and State University, Blackburg, Virginia, United States.
| |
Collapse
|
13
|
Zhao Y, Ansarullah, Kumar P, Mahoney JM, He H, Baker C, George J, Li S. Causal network perturbation analysis identifies known and novel type-2 diabetes driver genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.22.595431. [PMID: 38826370 PMCID: PMC11142180 DOI: 10.1101/2024.05.22.595431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
The molecular pathogenesis of diabetes is multifactorial, involving genetic predisposition and environmental factors that are not yet fully understood. However, pancreatic β-cell failure remains among the primary reasons underlying the progression of type-2 diabetes (T2D) making targeting β-cell dysfunction an attractive pathway for diabetes treatment. To identify genetic contributors to β-cell dysfunction, we investigated single-cell gene expression changes in β-cells from healthy (C57BL/6J) and diabetic (NZO/HlLtJ) mice fed with normal or high-fat, high-sugar diet (HFHS). Our study presents an innovative integration of the causal network perturbation assessment (ssNPA) framework with meta-cell transcriptome analysis to explore the genetic underpinnings of type-2 diabetes (T2D). By generating a reference causal network and in silico perturbation, we identified novel genes implicated in T2D and validated our candidates using the Knockout Mouse Phenotyping (KOMP) Project database.
Collapse
Affiliation(s)
- Yue Zhao
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Ansarullah
- Center for Biometric Analysis, The Jackson Laboratory, Bar Harbor, ME, USA
| | - Parveen Kumar
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Hao He
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Candice Baker
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Sheng Li
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut School of Medicine, Farmington, CT, USA
| |
Collapse
|
14
|
Algavi YM, Borenstein E. Relative dispersion ratios following fecal microbiota transplant elucidate principles governing microbial migration dynamics. Nat Commun 2024; 15:4447. [PMID: 38789466 PMCID: PMC11126695 DOI: 10.1038/s41467-024-48717-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 05/08/2024] [Indexed: 05/26/2024] Open
Abstract
Microorganisms frequently migrate from one ecosystem to another. Yet, despite the potential importance of this process in modulating the environment and the microbial ecosystem, our understanding of the fundamental forces that govern microbial dispersion is still lacking. Moreover, while theoretical models and in-vitro experiments have highlighted the contribution of species interactions to community assembly, identifying such interactions in vivo, specifically in communities as complex as the human gut, remains challenging. To address this gap, here we introduce a robust and rigorous computational framework, termed Relative Dispersion Ratio (RDR) analysis, and leverage data from well-characterized fecal microbiota transplant trials, to rigorously pinpoint dependencies between taxa during the colonization of human gastrointestinal tract. Our analysis identifies numerous pairwise dependencies between co-colonizing microbes during migration between gastrointestinal environments. We further demonstrate that identified dependencies agree with previously reported findings from in-vitro experiments and population-wide distribution patterns. Finally, we explore metabolic dependencies between these taxa and characterize the functional properties that facilitate effective dispersion. Collectively, our findings provide insights into the principles and determinants of community dynamics following ecological translocation, informing potential opportunities for precise community design.
Collapse
Affiliation(s)
- Yadid M Algavi
- Faculty of Medical & Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Elhanan Borenstein
- Faculty of Medical & Health Sciences, Tel Aviv University, Tel Aviv, Israel.
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
15
|
Wang M, Fontaine S, Jiang H, Li G. ADAPT: Analysis of Microbiome Differential Abundance by Pooling Tobit Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.14.594186. [PMID: 38798558 PMCID: PMC11118451 DOI: 10.1101/2024.05.14.594186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Microbiome differential abundance analysis remains a challenging problem despite multiple methods proposed in the literature. The excessive zeros and compositionality of metagenomics data are two main challenges for differential abundance analysis. We propose a novel method called "analysis of differential abundance by pooling Tobit models" (ADAPT) to overcome these two challenges. ADAPT uniquely treats zero counts as left-censored observations to facilitate computation and enhance interpretation. ADAPT also encompasses a theoretically justified way of selecting non-differentially abundant microbiome taxa as a reference for hypothesis testing. We generate synthetic data using independent simulation frameworks to show that ADAPT has more consistent false discovery rate control and higher statistical power than competitors. We use ADAPT to analyze 16S rRNA sequencing of saliva samples and shotgun metagenomics sequencing of plaque samples collected from infants in the COHRA2 study. The results provide novel insights into the association between the oral microbiome and early childhood dental caries.
Collapse
Affiliation(s)
- Mukai Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, USA
| | - Simon Fontaine
- Department of Statistics, University of Michigan, Ann Arbor, 48109, MI, USA
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, USA
| | - Gen Li
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, USA
| |
Collapse
|
16
|
Barão S, Xu Y, Llongueras JP, Vistein R, Goff L, Nielsen K, Bae BI, Smith RS, Walsh CA, Stein-O'Brien G, Müller U. BRN1/2 Function in Neocortical Size Determination and Microcephaly. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.02.565322. [PMID: 37961182 PMCID: PMC10635068 DOI: 10.1101/2023.11.02.565322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
The mammalian neocortex differs vastly in size and complexity between mammalian species, yet the mechanisms that lead to an increase in brain size during evolution are not known. We show here that two transcription factors coordinate gene expression programs in progenitor cells of the neocortex to regulate their proliferative capacity and neuronal output in order to determine brain size. Comparative studies in mice, ferrets and macaques demonstrate an evolutionary conserved function for these transcription factors to regulate progenitor behaviors across the mammalian clade. Strikingly, the two transcriptional regulators control the expression of large numbers of genes linked to microcephaly suggesting that transcriptional deregulation as an important determinant of the molecular pathogenesis of microcephaly, which is consistent with the finding that genetic manipulation of the two transcription factors leads to severe microcephaly.
Collapse
|
17
|
Cuevas-Diaz Duran R, Wei H, Wu J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics 2024; 25:444. [PMID: 38711017 PMCID: PMC11073985 DOI: 10.1186/s12864-024-10364-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 04/29/2024] [Indexed: 05/08/2024] Open
Abstract
BACKGROUND Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. MAIN BODY The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. CONCLUSIONS According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
Collapse
Affiliation(s)
- Raquel Cuevas-Diaz Duran
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico.
| | - Haichao Wei
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA
| | - Jiaqian Wu
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA.
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, 77030, USA.
| |
Collapse
|
18
|
Van Deynze K, Mumm C, Maltby CJ, Switzenberg JA, Todd PK, Boyle AP. Enhanced Detection and Genotyping of Disease-Associated Tandem Repeats Using HMMSTR and Targeted Long-Read Sequencing. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.01.24306681. [PMID: 38746091 PMCID: PMC11092683 DOI: 10.1101/2024.05.01.24306681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Tandem repeat sequences comprise approximately 8% of the human genome and are linked to more than 50 neurodegenerative disorders. Accurate characterization of disease-associated repeat loci remains resource intensive and often lacks high resolution genotype calls. We introduce a multiplexed, targeted nanopore sequencing panel and HMMSTR, a sequence-based tandem repeat copy number caller. HMMSTR outperforms current signal- and sequence-based callers relative to two assemblies and we show it performs with high accuracy in heterozygous regions and at low read coverage. The flexible panel allows us to capture disease associated regions at an average coverage of >150x. Using these tools, we successfully characterize known or suspected repeat expansions in patient derived samples. In these samples we also identify unexpected expanded alleles at tandem repeat loci not previously associated with the underlying diagnosis. This genotyping approach for tandem repeat expansions is scalable, simple, flexible, and accurate, offering significant potential for diagnostic applications and investigation of expansion co-occurrence in neurodegenerative disorders. Abstract Figure
Collapse
|
19
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
20
|
Kim H, Chang W, Chae SJ, Park JE, Seo M, Kim JK. scLENS: data-driven signal detection for unbiased scRNA-seq data analysis. Nat Commun 2024; 15:3575. [PMID: 38678050 DOI: 10.1038/s41467-024-47884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 04/14/2024] [Indexed: 04/29/2024] Open
Abstract
High dimensionality and noise have limited the new biological insights that can be discovered in scRNA-seq data. While dimensionality reduction tools have been developed to extract biological signals from the data, they often require manual determination of signal dimension, introducing user bias. Furthermore, a common data preprocessing method, log normalization, can unintentionally distort signals in the data. Here, we develop scLENS, a dimensionality reduction tool that circumvents the long-standing issues of signal distortion and manual input. Specifically, we identify the primary cause of signal distortion during log normalization and effectively address it by uniformizing cell vector lengths with L2 normalization. Furthermore, we utilize random matrix theory-based noise filtering and a signal robustness test to enable data-driven determination of the threshold for signal dimensions. Our method outperforms 11 widely used dimensionality reduction tools and performs particularly well for challenging scRNA-seq datasets with high sparsity and variability. To facilitate the use of scLENS, we provide a user-friendly package that automates accurate signal detection of scRNA-seq data without manual time-consuming tuning.
Collapse
Affiliation(s)
- Hyun Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, 45221, USA
| | - Seok Joo Chae
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea
| | - Jong-Eun Park
- Graduate School of Medical Science and Engineering, KAIST, Daejeon, 34141, Republic of Korea
| | - Minseok Seo
- Department of Computer and Information Science, Korea University, Sejong, 30019, Republic of Korea
| | - Jae Kyoung Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea.
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
21
|
Thulasiram MR, Yamamoto R, Olszewski RT, Gu S, Morell RJ, Hoa M, Dabdoub A. Molecular differences between neonatal and adult stria vascularis from organotypic explants and transcriptomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.24.590986. [PMID: 38712156 PMCID: PMC11071502 DOI: 10.1101/2024.04.24.590986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Summary The stria vascularis (SV), part of the blood-labyrinth barrier, is an essential component of the inner ear that regulates the ionic environment required for hearing. SV degeneration disrupts cochlear homeostasis, leading to irreversible hearing loss, yet a comprehensive understanding of the SV, and consequently therapeutic availability for SV degeneration, is lacking. We developed a whole-tissue explant model from neonatal and adult mice to create a robust platform for SV research. We validated our model by demonstrating that the proliferative behaviour of the SV in vitro mimics SV in vivo, providing a representative model and advancing high-throughput SV research. We also provided evidence for pharmacological intervention in our system by investigating the role of Wnt/β-catenin signaling in SV proliferation. Finally, we performed single-cell RNA sequencing from in vivo neonatal and adult mouse SV and revealed key genes and pathways that may play a role in SV proliferation and maintenance. Together, our results contribute new insights into investigating biological solutions for SV-associated hearing loss. Significance Hearing loss impairs our ability to communicate with people and interact with our environment. This can lead to social isolation, depression, cognitive deficits, and dementia. Inner ear degeneration is a primary cause of hearing loss, and our study provides an in depth look at one of the major sites of inner ear degeneration: the stria vascularis. The stria vascularis and associated blood-labyrinth barrier maintain the functional integrity of the auditory system, yet it is relatively understudied. By developing a new in vitro model for the young and adult stria vascularis and using single cell RNA sequencing, our study provides a novel approach to studying this tissue, contributing new insights and widespread implications for auditory neuroscience and regenerative medicine. Highlights - We established an organotypic explant system of the neonatal and adult stria vascularis with an intact blood-labyrinth barrier. - Proliferation of the stria vascularis decreases with age in vitro , modelling its proliferative behaviour in vivo . - Pharmacological studies using our in vitro SV model open possibilities for testing injury paradigms and therapeutic interventions. - Inhibition of Wnt signalling decreases proliferation in neonatal stria vascularis.- We identified key genes and transcription factors unique to developing and mature SV cell types using single cell RNA sequencing.
Collapse
|
22
|
Tian J, Bai X, Quek C. Single-Cell Informatics for Tumor Microenvironment and Immunotherapy. Int J Mol Sci 2024; 25:4485. [PMID: 38674070 PMCID: PMC11050520 DOI: 10.3390/ijms25084485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 04/12/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
Cancer comprises malignant cells surrounded by the tumor microenvironment (TME), a dynamic ecosystem composed of heterogeneous cell populations that exert unique influences on tumor development. The immune community within the TME plays a substantial role in tumorigenesis and tumor evolution. The innate and adaptive immune cells "talk" to the tumor through ligand-receptor interactions and signaling molecules, forming a complex communication network to influence the cellular and molecular basis of cancer. Such intricate intratumoral immune composition and interactions foster the application of immunotherapies, which empower the immune system against cancer to elicit durable long-term responses in cancer patients. Single-cell technologies have allowed for the dissection and characterization of the TME to an unprecedented level, while recent advancements in bioinformatics tools have expanded the horizon and depth of high-dimensional single-cell data analysis. This review will unravel the intertwined networks between malignancy and immunity, explore the utilization of computational tools for a deeper understanding of tumor-immune communications, and discuss the application of these approaches to aid in diagnosis or treatment decision making in the clinical setting, as well as the current challenges faced by the researchers with their potential future improvements.
Collapse
Affiliation(s)
| | | | - Camelia Quek
- Faculty of Medicine and Health, The University of Sydney, Sydney, NSW 2006, Australia; (J.T.); (X.B.)
| |
Collapse
|
23
|
Barry T, Mason K, Roeder K, Katsevich E. Robust differential expression testing for single-cell CRISPR screens at low multiplicity of infection. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.15.540875. [PMID: 38659821 PMCID: PMC11042176 DOI: 10.1101/2023.05.15.540875] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Single-cell CRISPR screens (perturb-seq) link genetic perturbations to phenotypic changes in individual cells. The most fundamental task in perturb-seq analysis is to test for association between a perturbation and a count outcome, such as gene expression. We conduct the first-ever comprehensive benchmarking study of association testing methods for low multiplicity-of-infection (MOI) perturb-seq data, finding that existing methods produce excess false positives. We conduct an extensive empirical investigation of the data, identifying three core analysis challenges: sparsity, confounding, and model misspecification. Finally, we develop an association testing method - SCEPTRE low-MOI - that resolves these analysis challenges and demonstrates improved calibration and power.
Collapse
|
24
|
Baharav TZ, Tse D, Salzman J. OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery. Proc Natl Acad Sci U S A 2024; 121:e2304671121. [PMID: 38564640 PMCID: PMC11009617 DOI: 10.1073/pnas.2304671121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 02/08/2024] [Indexed: 04/04/2024] Open
Abstract
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Collapse
Affiliation(s)
- Tavor Z. Baharav
- Eric and Wendy Schmidt Center, Broad Institute, Cambridge, MA02142
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA02115
| | - David Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA94305
| | - Julia Salzman
- Department of Biomedical Data Science, Stanford University, Stanford, CA94305
- Department of Biochemistry, Stanford University, Stanford, CA94305
- Department of Statistics (by courtesy), Stanford University, Stanford, CA94305
| |
Collapse
|
25
|
Huuki-Myers LA, Montgomery KD, Kwon SH, Cinquemani S, Eagles NJ, Gonzalez-Padilla D, Maden SK, Kleinman JE, Hyde TM, Hicks SC, Maynard KR, Collado-Torres L. Benchmark of cellular deconvolution methods using a multi-assay reference dataset from postmortem human prefrontal cortex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.09.579665. [PMID: 38405805 PMCID: PMC10888823 DOI: 10.1101/2024.02.09.579665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Background Cellular deconvolution of bulk RNA-sequencing (RNA-seq) data using single cell or nuclei RNA-seq (sc/snRNA-seq) reference data is an important strategy for estimating cell type composition in heterogeneous tissues, such as human brain. Computational methods for deconvolution have been developed and benchmarked against simulated data, pseudobulked sc/snRNA-seq data, or immunohistochemistry reference data. A major limitation in developing improved deconvolution algorithms has been the lack of integrated datasets with orthogonal measurements of gene expression and estimates of cell type proportions on the same tissue sample. Deconvolution algorithm performance has not yet been evaluated across different RNA extraction methods (cytosolic, nuclear, or whole cell RNA), different library preparation types (mRNA enrichment vs. ribosomal RNA depletion), or with matched single cell reference datasets. Results A rich multi-assay dataset was generated in postmortem human dorsolateral prefrontal cortex (DLPFC) from 22 tissue blocks. Assays included spatially-resolved transcriptomics, snRNA-seq, bulk RNA-seq (across six library/extraction RNA-seq combinations), and RNAScope/Immunofluorescence (RNAScope/IF) for six broad cell types. The Mean Ratio method, implemented in the DeconvoBuddies R package, was developed for selecting cell type marker genes. Six computational deconvolution algorithms were evaluated in DLPFC and predicted cell type proportions were compared to orthogonal RNAScope/IF measurements. Conclusions Bisque and hspe were the most accurate methods, were robust to differences in RNA library types and extractions. This multi-assay dataset showed that cell size differences, marker genes differentially quantified across RNA libraries, and cell composition variability in reference snRNA-seq impact the accuracy of current deconvolution methods.
Collapse
Affiliation(s)
- Louise A. Huuki-Myers
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Kelsey D. Montgomery
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Sang Ho Kwon
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Sophia Cinquemani
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Nicholas J. Eagles
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | | | - Sean K. Maden
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Joel E. Kleinman
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Thomas M. Hyde
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21205, USA
- Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Kristen R. Maynard
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Leonardo Collado-Torres
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205, USA
| |
Collapse
|
26
|
Marmarelis MG, Littman R, Battaglin F, Niedzwiecki D, Venook A, Ambite JL, Galstyan A, Lenz HJ, Ver Steeg G. q-Diffusion leverages the full dimensionality of gene coexpression in single-cell transcriptomics. Commun Biol 2024; 7:400. [PMID: 38565955 DOI: 10.1038/s42003-024-06104-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 03/25/2024] [Indexed: 04/04/2024] Open
Abstract
Unlocking the full dimensionality of single-cell RNA sequencing data (scRNAseq) is the next frontier to a richer, fuller understanding of cell biology. We introduce q-diffusion, a framework for capturing the coexpression structure of an entire library of genes, improving on state-of-the-art analysis tools. The method is demonstrated via three case studies. In the first, q-diffusion helps gain statistical significance for differential effects on patient outcomes when analyzing the CALGB/SWOG 80405 randomized phase III clinical trial, suggesting precision guidance for the treatment of metastatic colorectal cancer. Secondly, q-diffusion is benchmarked against existing scRNAseq classification methods using an in vitro PBMC dataset, in which the proposed method discriminates IFN-γ stimulation more accurately. The same case study demonstrates improvements in unsupervised cell clustering with the recent Tabula Sapiens human atlas. Finally, a local distributional segmentation approach for spatial scRNAseq, driven by q-diffusion, yields interpretable structures of human cortical tissue.
Collapse
Affiliation(s)
- Myrl G Marmarelis
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA.
| | - Russell Littman
- University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Francesca Battaglin
- Keck School of Medicine, University of Southern California, 1975 Zonal Ave., Los Angeles, CA, 90033, USA
| | | | - Alan Venook
- University of California San Francisco, San Francisco, CA, 94143, USA
| | - Jose-Luis Ambite
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
| | - Aram Galstyan
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
| | - Heinz-Josef Lenz
- Keck School of Medicine, University of Southern California, 1975 Zonal Ave., Los Angeles, CA, 90033, USA
| | - Greg Ver Steeg
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
- University of California Riverside, Riverside, CA, 92521, USA
| |
Collapse
|
27
|
Zhu C, Liu LY, Yamaguchi TN, Zhu H, Hugh-White R, Livingstone J, Patel Y, Kislinger T, Boutros PC. moPepGen: Rapid and Comprehensive Proteoform Identification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.28.587261. [PMID: 38585946 PMCID: PMC10996593 DOI: 10.1101/2024.03.28.587261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Gene expression is a multi-step transformation of biological information from its storage form (DNA) into functional forms (protein and some RNAs). Regulatory activities at each step of this transformation multiply a single gene into a myriad of proteoforms. Proteogenomics is the study of how genomic and transcriptomic variation creates this proteoform diversity, and is limited by the challenges of modeling the complexities of gene-expression. We therefore created moPepGen, a graph-based algorithm that comprehensively enumerates proteoforms in linear time. moPepGen works with multiple technologies, in multiple species and on all types of genetic and transcriptomic data. In human cancer proteomes, it detects and quantifies previously unobserved noncanonical peptides arising from germline and somatic genomic variants, noncoding open reading frames, RNA fusions and RNA circularization. By enabling efficient identification and quantitation of previously hidden proteins in both existing and new proteomic data, moPepGen facilitates all proteogenomics applications. It is available at: https://github.com/uclahs-cds/package-moPepGen.
Collapse
Affiliation(s)
- Chenghao Zhu
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, CA, USA
- Department of Urology, University of California, Los Angeles, CA, USA
| | - Lydia Y. Liu
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Department of Medical Biophysics, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
- Vector Institute for Artificial Intelligence, Toronto, Canada
| | - Takafumi N. Yamaguchi
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, CA, USA
| | - Helen Zhu
- Department of Medical Biophysics, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
- Vector Institute for Artificial Intelligence, Toronto, Canada
| | - Rupert Hugh-White
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, CA, USA
| | - Julie Livingstone
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, CA, USA
| | - Yash Patel
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, CA, USA
| | - Thomas Kislinger
- Department of Medical Biophysics, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
| | - Paul C. Boutros
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, CA, USA
- Department of Urology, University of California, Los Angeles, CA, USA
- Department of Medical Biophysics, University of Toronto, Toronto, Canada
| |
Collapse
|
28
|
Keskus A, Bryant A, Ahmad T, Yoo B, Aganezov S, Goretsky A, Donmez A, Lansdon LA, Rodriguez I, Park J, Liu Y, Cui X, Gardner J, McNulty B, Sacco S, Shetty J, Zhao Y, Tran B, Narzisi G, Helland A, Cook DE, Chang PC, Kolesnikov A, Carroll A, Molloy EK, Pushel I, Guest E, Pastinen T, Shafin K, Miga KH, Malikic S, Day CP, Robine N, Sahinalp C, Dean M, Farooqi MS, Paten B, Kolmogorov M. Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.22.24304756. [PMID: 38585974 PMCID: PMC10996739 DOI: 10.1101/2024.03.22.24304756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Most current studies rely on short-read sequencing to detect somatic structural variation (SV) in cancer genomes. Long-read sequencing offers the advantage of better mappability and long-range phasing, which results in substantial improvements in germline SV detection. However, current long-read SV detection methods do not generalize well to the analysis of somatic SVs in tumor genomes with complex rearrangements, heterogeneity, and aneuploidy. Here, we present Severus: a method for the accurate detection of different types of somatic SVs using a phased breakpoint graph approach. To benchmark various short- and long-read SV detection methods, we sequenced five tumor/normal cell line pairs with Illumina, Nanopore, and PacBio sequencing platforms; on this benchmark Severus showed the highest F1 scores (harmonic mean of the precision and recall) as compared to long-read and short-read methods. We then applied Severus to three clinical cases of pediatric cancer, demonstrating concordance with known genetic findings as well as revealing clinically relevant cryptic rearrangements missed by standard genomic panels.
Collapse
Affiliation(s)
- Ayse Keskus
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Asher Bryant
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Tanveer Ahmad
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Byunggil Yoo
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | | | - Anton Goretsky
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Ataberk Donmez
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Lisa A. Lansdon
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Isabel Rodriguez
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, USA
| | - Jimin Park
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Yuelin Liu
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Xiwen Cui
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | | | - Samuel Sacco
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jyoti Shetty
- Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - Yongmei Zhao
- Sequencing Facility Bioinformatics Group, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - Bao Tran
- Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | | | | | | | | | | | | | - Erin K. Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Irina Pushel
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Erin Guest
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Tomi Pastinen
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Kishwar Shafin
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Salem Malikic
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Chi-Ping Day
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | - Cenk Sahinalp
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Michael Dean
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, USA
| | - Midhat S. Farooqi
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | | | - Mikhail Kolmogorov
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| |
Collapse
|
29
|
Jones EF, Haldar A, Oza VH, Lasseigne BN. Quantifying transcriptome diversity: a review. Brief Funct Genomics 2024; 23:83-94. [PMID: 37225889 DOI: 10.1093/bfgp/elad019] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 04/14/2023] [Accepted: 05/05/2023] [Indexed: 05/26/2023] Open
Abstract
Following the central dogma of molecular biology, gene expression heterogeneity can aid in predicting and explaining the wide variety of protein products, functions and, ultimately, heterogeneity in phenotypes. There is currently overlapping terminology used to describe the types of diversity in gene expression profiles, and overlooking these nuances can misrepresent important biological information. Here, we describe transcriptome diversity as a measure of the heterogeneity in (1) the expression of all genes within a sample or a single gene across samples in a population (gene-level diversity) or (2) the isoform-specific expression of a given gene (isoform-level diversity). We first overview modulators and quantification of transcriptome diversity at the gene level. Then, we discuss the role alternative splicing plays in driving transcript isoform-level diversity and how it can be quantified. Additionally, we overview computational resources for calculating gene-level and isoform-level diversity for high-throughput sequencing data. Finally, we discuss future applications of transcriptome diversity. This review provides a comprehensive overview of how gene expression diversity arises, and how measuring it determines a more complete picture of heterogeneity across proteins, cells, tissues, organisms and species.
Collapse
Affiliation(s)
- Emma F Jones
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Anisha Haldar
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Vishal H Oza
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Brittany N Lasseigne
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
30
|
Weideman AMK, Wang R, Ibrahim JG, Jiang Y. Canopy2: tumor phylogeny inference by bulk DNA and single-cell RNA sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.18.585595. [PMID: 38562795 PMCID: PMC10983938 DOI: 10.1101/2024.03.18.585595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Tumors are comprised of a mixture of distinct cell populations that differ in terms of genetic makeup and function. Such heterogeneity plays a role in the development of drug resistance and the ineffectiveness of targeted cancer therapies. Insight into this complexity can be obtained through the construction of a phylogenetic tree, which illustrates the evolutionary lineage of tumor cells as they acquire mutations over time. We propose Canopy2, a Bayesian framework that uses single nucleotide variants derived from bulk DNA and single-cell RNA sequencing to infer tumor phylogeny and conduct mutational profiling of tumor subpopulations. Canopy2 uses Markov chain Monte Carlo methods to sample from a joint probability distribution involving a mixture of binomial and beta-binomial distributions, specifically chosen to account for the sparsity and stochasticity of the single-cell data. Canopy2 demystifies the sources of zeros in the single-cell data and separates zeros categorized as non-cancerous (cells without mutations), stochastic (mutations not expressed due to bursting), and technical (expressed mutations not picked up by sequencing). Simulations demonstrate that Canopy2 consistently outperforms competing methods and reconstructs the clonal tree with high fidelity, even in situations involving low sequencing depth, poor single-cell yield, and highly-advanced and polyclonal tumors. We further assess the performance of Canopy2 through application to breast cancer and glioblastoma data, benchmarking against existing methods. Canopy2 is an open-source R package available at https://github.com/annweideman/canopy2.
Collapse
Affiliation(s)
- Ann Marie K. Weideman
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Rujin Wang
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Joseph G. Ibrahim
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yuchao Jiang
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
31
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.10.579721. [PMID: 38496530 PMCID: PMC10942266 DOI: 10.1101/2024.02.10.579721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
|
32
|
Lin KZ, Qiu Y, Roeder K. eSVD-DE: Cohort-wide differential expression in single-cell RNA-seq data using exponential-family embeddings. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.22.568369. [PMID: 38045428 PMCID: PMC10690270 DOI: 10.1101/2023.11.22.568369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Background Single-cell RNA-sequencing (scRNA) datasets are becoming increasingly popular in clinical and cohort studies, but there is a lack of methods to investigate differentially expressed (DE) genes among such datasets with numerous individuals. While numerous methods exist to find DE genes for scRNA data from limited individuals, differential-expression testing for large cohorts of case and control individuals using scRNA data poses unique challenges due to substantial effects of human variation, i.e., individual-level confounding covariates that are difficult to account for in the presence of sparsely-observed genes. Results We develop the eSVD-DE, a matrix factorization that pools information across genes and removes confounding covariate effects, followed by a novel two-sample test in mean expression between case and control individuals. In general, differential testing after dimension reduction yields an inflation of Type-1 errors. However, we overcome this by testing for differences between the case and control individuals' posterior mean distributions via a hierarchical model. In previously published datasets of various biological systems, eSVD-DE has more accuracy and power compared to other DE methods typically repurposed for analyzing cohort-wide differential expression. Conclusions eSVD-DE proposes a novel and powerful way to test for DE genes among cohorts after performing a dimension reduction. Accurate identification of differential expression on the individual level, instead of the cell level, is important for linking scRNA-seq studies to our understanding of the human population.
Collapse
Affiliation(s)
- Kevin Z Lin
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Yixuan Qiu
- School of Statistics & Management, Shanghai University of Finance and Economics, Shanghai,People's Republic of China
| | - Kathryn Roeder
- Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
33
|
Öling S, Struck E, Noreen-Thorsen M, Zwahlen M, von Feilitzen K, Odeberg J, Pontén F, Lindskog C, Uhlén M, Dusart P, Butler LM. A human stomach cell type transcriptome atlas. BMC Biol 2024; 22:36. [PMID: 38355543 PMCID: PMC10865703 DOI: 10.1186/s12915-024-01812-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 01/02/2024] [Indexed: 02/16/2024] Open
Abstract
BACKGROUND The identification of cell type-specific genes and their modification under different conditions is central to our understanding of human health and disease. The stomach, a hollow organ in the upper gastrointestinal tract, provides an acidic environment that contributes to microbial defence and facilitates the activity of secreted digestive enzymes to process food and nutrients into chyme. In contrast to other sections of the gastrointestinal tract, detailed descriptions of cell type gene enrichment profiles in the stomach are absent from the major single-cell sequencing-based atlases. RESULTS Here, we use an integrative correlation analysis method to predict human stomach cell type transcriptome signatures using unfractionated stomach RNAseq data from 359 individuals. We profile parietal, chief, gastric mucous, gastric enteroendocrine, mitotic, endothelial, fibroblast, macrophage, neutrophil, T-cell, and plasma cells, identifying over 1600 cell type-enriched genes. CONCLUSIONS We uncover the cell type expression profile of several non-coding genes strongly associated with the progression of gastric cancer and, using a sex-based subset analysis, uncover a panel of male-only chief cell-enriched genes. This study provides a roadmap to further understand human stomach biology.
Collapse
Affiliation(s)
- S Öling
- Department of Clinical Medicine, Translational Vascular Research, The Arctic University of Norway, 9019, Tromsø, Norway
| | - E Struck
- Department of Clinical Medicine, Translational Vascular Research, The Arctic University of Norway, 9019, Tromsø, Norway
| | - M Noreen-Thorsen
- Department of Clinical Medicine, Translational Vascular Research, The Arctic University of Norway, 9019, Tromsø, Norway
| | - M Zwahlen
- Science for Life Laboratory, Department of Protein Science, Royal Institute of Technology (KTH), 171 21, Stockholm, Sweden
| | - K von Feilitzen
- Science for Life Laboratory, Department of Protein Science, Royal Institute of Technology (KTH), 171 21, Stockholm, Sweden
| | - J Odeberg
- Department of Clinical Medicine, Translational Vascular Research, The Arctic University of Norway, 9019, Tromsø, Norway
- Science for Life Laboratory, Department of Protein Science, Royal Institute of Technology (KTH), 171 21, Stockholm, Sweden
- The University Hospital of North Norway (UNN), 9019, Tromsø, Norway
- Department of Haematology, Coagulation Unit, Karolinska University Hospital, 171 76, Stockholm, Sweden
| | - F Pontén
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, 752 37, Uppsala, Sweden
| | - C Lindskog
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, 752 37, Uppsala, Sweden
| | - M Uhlén
- Science for Life Laboratory, Department of Protein Science, Royal Institute of Technology (KTH), 171 21, Stockholm, Sweden
| | - P Dusart
- Science for Life Laboratory, Department of Protein Science, Royal Institute of Technology (KTH), 171 21, Stockholm, Sweden
- Clinical Chemistry and Blood Coagulation Research, Department of Molecular Medicine and Surgery, Karolinska Institute, 171 76, Stockholm, Sweden
- Clinical Chemistry, Karolinska University Laboratory, Karolinska University Hospital, 171 76, Stockholm, Sweden
| | - L M Butler
- Department of Clinical Medicine, Translational Vascular Research, The Arctic University of Norway, 9019, Tromsø, Norway.
- Science for Life Laboratory, Department of Protein Science, Royal Institute of Technology (KTH), 171 21, Stockholm, Sweden.
- Clinical Chemistry and Blood Coagulation Research, Department of Molecular Medicine and Surgery, Karolinska Institute, 171 76, Stockholm, Sweden.
- Clinical Chemistry, Karolinska University Laboratory, Karolinska University Hospital, 171 76, Stockholm, Sweden.
| |
Collapse
|
34
|
Groh JS, Vik DC, Stevens KA, Brown PJ, Langley CH, Coop G. Distinct ancient structural polymorphisms control heterodichogamy in walnuts and hickories. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.23.573205. [PMID: 38187547 PMCID: PMC10769452 DOI: 10.1101/2023.12.23.573205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
The maintenance of stable mating type polymorphisms is a classic example of balancing selection, underlying the nearly ubiquitous 50/50 sex ratio in species with separate sexes. One lesser known but intriguing example of a balanced mating polymorphism in angiosperms is heterodichogamy - polymorphism for opposing directions of dichogamy (temporal separation of male and female function in hermaphrodites) within a flowering season. This mating system is common throughout Juglandaceae, the family that includes globally important and iconic nut and timber crops - walnuts (Juglans), as well as pecan and other hickories (Carya). In both genera, heterodichogamy is controlled by a single dominant allele. We fine-map the locus in each genus, and find two ancient (>50 Mya) structural variants involving different genes that both segregate as genus-wide trans-species polymorphisms. The Juglans locus maps to a ca. 20 kb structural variant adjacent to a probable trehalose phosphate phosphatase (TPPD-1), homologs of which regulate floral development in model systems. TPPD-1 is differentially expressed between morphs in developing male flowers, with increased allele-specific expression of the dominant haplotype copy. Across species, the dominant haplotype contains a tandem array of duplicated sequence motifs, part of which is an inverted copy of the TPPD-1 3' UTR. These repeats generate various distinct small RNAs matching sequences within the 3' UTR and further downstream. In contrast to the single-gene Juglans locus, the Carya heterodichogamy locus maps to a ca. 200-450 kb cluster of tightly linked polymorphisms across 20 genes, some of which have known roles in flowering and are differentially expressed between morphs in developing flowers. The dominant haplotype in pecan, which is nearly always heterozygous and appears to rarely recombine, shows markedly reduced genetic diversity and is over twice as long as its recessive counterpart due to accumulation of various types of transposable elements. We did not detect either genetic system in other heterodichogamous genera within Juglandaceae, suggesting that additional genetic systems for heterodichogamy may yet remain undiscovered.
Collapse
Affiliation(s)
- Jeffrey S Groh
- Department of Evolution and Ecology, University of California, Davis
- Center for Population Biology, University of California, Davis
| | - Diane C Vik
- Department of Evolution and Ecology, University of California, Davis
| | - Kristian A Stevens
- Department of Evolution and Ecology, University of California, Davis
- Department of Computer Science, University of California, Davis
| | - Patrick J Brown
- Department of Plant Sciences, University of California, Davis
| | - Charles H Langley
- Department of Evolution and Ecology, University of California, Davis
- Center for Population Biology, University of California, Davis
| | - Graham Coop
- Department of Evolution and Ecology, University of California, Davis
- Center for Population Biology, University of California, Davis
| |
Collapse
|
35
|
Church SH, Mah JL, Wagner G, Dunn CW. Normalizing need not be the norm: count-based math for analyzing single-cell data. Theory Biosci 2024; 143:45-62. [PMID: 37947999 DOI: 10.1007/s12064-023-00408-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 10/13/2023] [Indexed: 11/12/2023]
Abstract
Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.
Collapse
Affiliation(s)
- Samuel H Church
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA.
| | - Jasmine L Mah
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Günter Wagner
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
- Yale Systems Biology Institute, Yale University, New Haven, CT, USA
- Department of Obstetrics, Gynecology and Reproductive Sciences, Yale Medical School, New Haven, CT, USA
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA
| | - Casey W Dunn
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| |
Collapse
|
36
|
Kousnetsov R, Bourque J, Surnov A, Fallahee I, Hawiger D. Single-cell sequencing analysis within biologically relevant dimensions. Cell Syst 2024; 15:83-103.e11. [PMID: 38198894 DOI: 10.1016/j.cels.2023.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 05/23/2023] [Accepted: 12/14/2023] [Indexed: 01/12/2024]
Abstract
The currently predominant approach to transcriptomic and epigenomic single-cell analysis depends on a rigid perspective constrained by reduced dimensions and algorithmically derived and annotated clusters. Here, we developed Seqtometry (sequencing-to-measurement), a single-cell analytical strategy based on biologically relevant dimensions enabled by advanced scoring with multiple gene sets (signatures) for examination of gene expression and accessibility across various organ systems. By utilizing information only in the form of specific signatures, Seqtometry bypasses unsupervised clustering and individual annotations of clusters. Instead, Seqtometry combines qualitative and quantitative cell-type identification with specific characterization of diverse biological processes under experimental or disease conditions. Comprehensive analysis by Seqtometry of various immune cells as well as other cells from different organs and disease-induced states, including multiple myeloma and Alzheimer's disease, surpasses corresponding cluster-based analytical output. We propose Seqtometry as a single-cell sequencing analysis approach applicable for both basic and clinical research.
Collapse
Affiliation(s)
- Robert Kousnetsov
- Department of Molecular Microbiology and Immunology, Saint Louis University School of Medicine, St. Louis, MO, USA
| | - Jessica Bourque
- Department of Molecular Microbiology and Immunology, Saint Louis University School of Medicine, St. Louis, MO, USA
| | - Alexey Surnov
- Department of Molecular Microbiology and Immunology, Saint Louis University School of Medicine, St. Louis, MO, USA
| | - Ian Fallahee
- Department of Molecular Microbiology and Immunology, Saint Louis University School of Medicine, St. Louis, MO, USA
| | - Daniel Hawiger
- Department of Molecular Microbiology and Immunology, Saint Louis University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
37
|
Maiorino E, De Marzio M, Xu Z, Yun JH, Chase RP, Hersh CP, Weiss ST, Silverman EK, Castaldi PJ, Glass K. Joint clinical and molecular subtyping of COPD with variational autoencoders. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.08.19.23294298. [PMID: 38260473 PMCID: PMC10802661 DOI: 10.1101/2023.08.19.23294298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Chronic Obstructive Pulmonary Disease (COPD) is a complex, heterogeneous disease. Traditional subtyping methods generally focus on either the clinical manifestations or the molecular endotypes of the disease, resulting in classifications that do not fully capture the disease's complexity. Here, we bridge this gap by introducing a subtyping pipeline that integrates clinical and gene expression data with variational autoencoders. We apply this methodology to the COPDGene study, a large study of current and former smoking individuals with and without COPD. Our approach generates a set of vector embeddings, called Personalized Integrated Profiles (PIPs), that recapitulate the joint clinical and molecular state of the subjects in the study. Prediction experiments show that the PIPs have a predictive accuracy comparable to or better than other embedding approaches. Using trajectory learning approaches, we analyze the main trajectories of variation in the PIP space and identify five well-separated subtypes with distinct clinical phenotypes, expression signatures, and disease outcomes. Notably, these subtypes are more robust to data resampling compared to those identified using traditional clustering approaches. Overall, our findings provide new avenues to establish fine-grained associations between the clinical characteristics, molecular processes, and disease outcomes of COPD.
Collapse
Affiliation(s)
- Enrico Maiorino
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Margherita De Marzio
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Zhonghui Xu
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Jeong H. Yun
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Robert P. Chase
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Craig P. Hersh
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Scott T. Weiss
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Edwin K. Silverman
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | | | | |
Collapse
|
38
|
Chrysinas P, Venkatesan S, Ang I, Ghosh V, Chen C, Neelamegham S, Gunawan R. Cell and tissue-specific glycosylation pathways informed by single-cell transcriptomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.26.559616. [PMID: 38260527 PMCID: PMC10802235 DOI: 10.1101/2023.09.26.559616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
While single cell studies have made significant impacts in various subfields of biology, they lag in the Glycosciences. To address this gap, we analyzed single-cell glycogene expressions in the Tabula Sapiens dataset of human tissues and cell types using a recent glycosylation-specific gene ontology (GlycoEnzOnto). At the median sequencing (count) depth, ~40-50 out of 400 glycogenes were detected in individual cells. Upon increasing the sequencing depth, the number of detectable glycogenes saturates at ~200 glycogenes, suggesting that the average human cell expresses about half of the glycogene repertoire. Hierarchies in glycogene and glycopathway expressions emerged from our analysis: nucleotide-sugar synthesis and transport exhibited the highest gene expressions, followed by genes for core enzymes, glycan modification and extensions, and finally terminal modifications. Interestingly, the same cell types showed variable glycopathway expressions based on their organ or tissue origin, suggesting nuanced cell- and tissue-specific glycosylation patterns. Probing deeper into the transcription factors (TFs) of glycogenes, we identified distinct groupings of TFs controlling different aspects of glycosylation: core biosynthesis, terminal modifications, etc. We present webtools to explore the interconnections across glycogenes, glycopathways, and TFs regulating glycosylation in human cell/tissue types. Overall, the study presents an overview of glycosylation across multiple human organ systems.
Collapse
Affiliation(s)
- Panagiotis Chrysinas
- Department of Chemical and Biological Engineering, University at Buffalo-SUNY, Buffalo, NY, 14260, USA
| | - Shriramprasad Venkatesan
- Department of Chemical and Biological Engineering, University at Buffalo-SUNY, Buffalo, NY, 14260, USA
| | - Isaac Ang
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Vishnu Ghosh
- Department of Chemical and Biological Engineering, University at Buffalo-SUNY, Buffalo, NY, 14260, USA
| | - Changyou Chen
- Department of Computer Science and Engineering, University at Buffalo-SUNY, Buffalo, NY, 14260, USA
| | - Sriram Neelamegham
- Department of Chemical and Biological Engineering, University at Buffalo-SUNY, Buffalo, NY, 14260, USA
| | - Rudiyanto Gunawan
- Department of Chemical and Biological Engineering, University at Buffalo-SUNY, Buffalo, NY, 14260, USA
| |
Collapse
|
39
|
Holcombe J, Weavers H. Functional-metabolic coupling in distinct renal cell types coordinates organ-wide physiology and delays premature ageing. Nat Commun 2023; 14:8405. [PMID: 38110414 PMCID: PMC10728150 DOI: 10.1038/s41467-023-44098-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 11/30/2023] [Indexed: 12/20/2023] Open
Abstract
Precise coupling between cellular physiology and metabolism is emerging as a vital relationship underpinning tissue health and longevity. Nevertheless, functional-metabolic coupling within heterogenous microenvironments in vivo remains poorly understood due to tissue complexity and metabolic plasticity. Here, we establish the Drosophila renal system as a paradigm for linking mechanistic analysis of metabolism, at single-cell resolution, to organ-wide physiology. Kidneys are amongst the most energetically-demanding organs, yet exactly how individual cell types fine-tune metabolism to meet their diverse, unique physiologies over the life-course remains unclear. Integrating live-imaging of metabolite and organelle dynamics with spatio-temporal genetic perturbation within intact functional tissue, we uncover distinct cellular metabolic signatures essential to support renal physiology and healthy ageing. Cell type-specific programming of glucose handling, PPP-mediated glutathione regeneration and FA β-oxidation via dynamic lipid-peroxisomal networks, downstream of differential ERR receptor activity, precisely match cellular energetic demands whilst limiting damage and premature senescence; however, their dramatic dysregulation may underlie age-related renal dysfunction.
Collapse
Affiliation(s)
- Jack Holcombe
- School of Biochemistry, Biomedical Sciences, University of Bristol, Bristol, BS8 1TD, UK
| | - Helen Weavers
- School of Biochemistry, Biomedical Sciences, University of Bristol, Bristol, BS8 1TD, UK.
| |
Collapse
|
40
|
Karakurt HU, Pir P. SUMA: a lightweight machine learning model-powered shared nearest neighbour-based clustering application interface for scRNA-Seq data. Turk J Biol 2023; 47:413-422. [PMID: 38681777 PMCID: PMC11045205 DOI: 10.55730/1300-0152.2675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/28/2023] [Accepted: 12/18/2023] [Indexed: 05/01/2024] Open
Abstract
Background/aim Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach that presents data as a "shared nearest neighbour" graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters.Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to obtain the optimum clustering results. Moreover, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/). Materials and methods Publicly available scRNA-Seq datasets and 3 different graph-based clustering algorithms were used to develop SUMA, and a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the adjusted Rand index (ARI) and true labels of each dataset. The data were split into training and test datasets, and the model was built and optimised using Scikit-learn (Python) and randomForest (R) libraries. Results The accuracy of our machine learning model was 0.96, while the AUC of the ROC curve was 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours. Conclusion We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods and enables nonbioinformatician users to cluster and visualise their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.
Collapse
Affiliation(s)
- Hamza Umut Karakurt
- Department of Bioengineering, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkiye
- Idea Technology Solutions R&D Center, İstanbul, Turkiye
| | - Pınar Pir
- Department of Bioengineering, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkiye
- Idea Technology Solutions R&D Center, İstanbul, Turkiye
| |
Collapse
|
41
|
Choi J, Ehrlich ME, Roussos P, Wang P, Yuan GC, Song X. QuadST: A Powerful and Robust Approach for Identifying Cell-Cell Interaction-Changed Genes on Spatially Resolved Transcriptomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.04.570019. [PMID: 38106025 PMCID: PMC10723309 DOI: 10.1101/2023.12.04.570019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Spatially resolved transcriptomics (SRT) have enabled profiling spatial organization of cells and their transcriptome in situ. Various analytical methods have been developed to uncover cell-cell interaction processes using SRT data. To improve upon existing efforts, we developed a novel statistical framework called QuadST for the robust and powerful identification of interaction-changed genes (ICGs) for cell-type-pair specific interactions on a single-cell SRT dataset. QuadST is motivated by the idea that in the presence of cell-cell interaction, gene expression level can vary with cell-cell distance between cell type pairs, which can be particularly pronounced within and in the vicinity of cell-cell interaction distance. Specifically, QuadST infers ICGs in a specific cell type pair's interaction based on a quantile regression model, which allows us to assess the strength of distance-expression association across entire distance quantiles conditioned on gene expression level. To identify ICGs, QuadST performs a hypothesis testing with an empirically estimated FDR, whose upper bound is determined by the ratio of cumulative associations at symmetrically smaller and larger distance quantiles simultaneously across all genes. Simulation studies illustrate that QuadST provides consistent FDR control and better power performance than other compared methods. Its application on SRT datasets profiled from mouse brains demonstrates that QuadST can identify ICGs presumed to play a role in specific cell type pair interactions (e.g., synaptic pathway genes among excitatory neuron cell interactions). These results suggest that QuadST can be a useful tool to discover genes and regulatory processes involved in specific cell type pair interactions.
Collapse
Affiliation(s)
- Jinmyung Choi
- Institute for Health Care Delivery Science, Department of Population Health Science and Policy, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Michelle E. Ehrlich
- Departments of Neurology, Pediatrics, and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Panos Roussos
- Center for Disease Neurogenomics, Department of Psychiatry, Department of Genetics and Genomics Sciences, Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, NY, USA; Mental Illness Research Education, and Clinical Center (VISN 2 South), James J. Peters VA Medical Center, Bronx, NY, USA
| | - Pei Wang
- Department of Genetics and Genomic Sciences, Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Guo-Cheng Yuan
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY USA
| | - Xiaoyu Song
- Institute for Health Care Delivery Science, Department of Population Health Science and Policy, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
42
|
Wang Z, Xie X, Liu S, Ji Z. scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data. Life Sci Alliance 2023; 6:e202302103. [PMID: 37788907 PMCID: PMC10547911 DOI: 10.26508/lsa.202302103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 09/21/2023] [Accepted: 09/22/2023] [Indexed: 10/05/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables researchers to reveal previously unknown cell heterogeneity and functional diversity, which is impossible with bulk RNA sequencing. Clustering approaches are widely used for analyzing scRNA-seq data and identifying cell types and states. In the past few years, various advanced computational strategies emerged. However, the low generalization and high computational cost are the main bottlenecks of existing methods. In this study, we established a novel computational framework, scFseCluster, for scRNA-seq clustering analysis. scFseCluster incorporates a metaheuristic algorithm (Feature Selection based on Quantum Squirrel Search Algorithm) to extract the optimal gene set, which largely guarantees the performance of cell clustering. We conducted simulation experiments in several aspects to verify the performance of the proposed approach. scFseCluster performed very well on eight benchmark scRNA-seq datasets because of the optimal gene sets obtained using the Feature Selection based on Quantum Squirrel Search Algorithm. The comparative study demonstrated the significant advantages of scFseCluster over seven State-of-the-Art algorithms. In addition, our analysis shows that feature selection on high-variable genes can significantly improve clustering performance. In conclusion, our study demonstrates that scFseCluster is a highly versatile tool for enhancing scRNA-seq data clustering analysis.
Collapse
Affiliation(s)
- Zongqin Wang
- https://ror.org/05td3s095 College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
| | - Xiaojun Xie
- https://ror.org/05td3s095 College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- https://ror.org/05td3s095 Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| | - Shouyang Liu
- https://ror.org/05td3s095 Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
| | - Zhiwei Ji
- https://ror.org/05td3s095 College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- https://ror.org/05td3s095 Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
43
|
Zhang Y, Ben Nathan J, Moreno A, Merkel R, Kahng MW, Hayes MR, Reiner BC, Crist RC, Schmidt HD. Calcitonin receptor signaling in nucleus accumbens D1R- and D2R-expressing medium spiny neurons bidirectionally alters opioid taking in male rats. Neuropsychopharmacology 2023; 48:1878-1888. [PMID: 37355732 PMCID: PMC10584857 DOI: 10.1038/s41386-023-01634-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 06/12/2023] [Accepted: 06/13/2023] [Indexed: 06/26/2023]
Abstract
The high rates of relapse associated with current medications used to treat opioid use disorder (OUD) necessitate research that expands our understanding of the neural mechanisms regulating opioid taking to identify molecular substrates that could be targeted by novel pharmacotherapies to treat OUD. Recent studies show that activation of calcitonin receptors (CTRs) is sufficient to reduce the rewarding effects of addictive drugs in rodents. However, the role of central CTR signaling in opioid-mediated behaviors has not been studied. Here, we used single nuclei RNA sequencing (snRNA-seq), fluorescent in situ hybridization (FISH), and immunohistochemistry (IHC) to characterize cell type-specific patterns of CTR expression in the nucleus accumbens (NAc), a brain region that plays a critical role in voluntary drug taking. Using these approaches, we identified CTRs expressed on D1R- and D2R-expressing medium spiny neurons (MSNs) in the medial shell subregion of the NAc. Interestingly, Calcr transcripts were expressed at higher levels in D2R- versus D1R-expressing MSNs. Cre-dependent viral-mediated miRNA knockdown of CTRs in transgenic male rats was then used to determine the functional significance of endogenous CTR signaling in opioid taking. We discovered that reduced CTR expression specifically in D1R-expressing MSNs potentiated/augmented opioid self-administration. In contrast, reduced CTR expression specifically in D2R-expressing MSNs attenuated opioid self-administration. These findings highlight a novel cell type-specific mechanism by which CTR signaling in the ventral striatum bidirectionally modulates voluntary opioid taking and support future studies aimed at targeting central CTR-expressing circuits to treat OUD.
Collapse
Affiliation(s)
- Yafang Zhang
- Department of Biobehavioral Health Sciences, School of Nursing, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Jennifer Ben Nathan
- Department of Biobehavioral Health Sciences, School of Nursing, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Amanda Moreno
- Department of Biobehavioral Health Sciences, School of Nursing, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Riley Merkel
- Department of Biobehavioral Health Sciences, School of Nursing, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Michelle W Kahng
- Department of Biobehavioral Health Sciences, School of Nursing, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Matthew R Hayes
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Benjamin C Reiner
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Richard C Crist
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Heath D Schmidt
- Department of Biobehavioral Health Sciences, School of Nursing, University of Pennsylvania, Philadelphia, PA, 19104, USA.
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
44
|
Tan KT, Slevin MK, Leibowitz ML, Garrity-Janger M, Li H, Meyerson M. Neotelomeres and Telomere-Spanning Chromosomal Arm Fusions in Cancer Genomes Revealed by Long-Read Sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.30.569101. [PMID: 38077026 PMCID: PMC10705422 DOI: 10.1101/2023.11.30.569101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Alterations in the structure and location of telomeres are key events in cancer genome evolution. However, previous genomic approaches, unable to span long telomeric repeat arrays, could not characterize the nature of these alterations. Here, we applied both long-read and short-read genome sequencing to assess telomere repeat-containing structures in cancers and cancer cell lines. Using long-read genome sequences that span telomeric repeat arrays, we defined four types of telomere repeat variations in cancer cells: neotelomeres where telomere addition heals chromosome breaks, chromosomal arm fusions spanning telomere repeats, fusions of neotelomeres, and peri-centromeric fusions with adjoined telomere and centromere repeats. Analysis of lung adenocarcinoma genome sequences identified somatic neotelomere and telomere-spanning fusion alterations. These results provide a framework for systematic study of telomeric repeat arrays in cancer genomes, that could serve as a model for understanding the somatic evolution of other repetitive genomic elements.
Collapse
Affiliation(s)
- Kar-Tong Tan
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
| | - Michael K. Slevin
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Center for Cancer Genomics, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Mitchell L. Leibowitz
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
| | - Max Garrity-Janger
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, USA
| | - Matthew Meyerson
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
- Center for Cancer Genomics, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Lead contact
| |
Collapse
|
45
|
Zheng W, Min W, Wang S. TsImpute: an accurate two-step imputation method for single-cell RNA-seq data. Bioinformatics 2023; 39:btad731. [PMID: 38039139 PMCID: PMC10724850 DOI: 10.1093/bioinformatics/btad731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 11/22/2023] [Accepted: 11/30/2023] [Indexed: 12/03/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) technology has enabled discovering gene expression patterns at single cell resolution. However, due to technical limitations, there are usually excessive zeros, called "dropouts," in scRNA-seq data, which may mislead the downstream analysis. Therefore, it is crucial to impute these dropouts to recover the biological information. RESULTS We propose a two-step imputation method called tsImpute to impute scRNA-seq data. At the first step, tsImpute adopts zero-inflated negative binomial distribution to discriminate dropouts from true zeros and performs initial imputation by calculating the expected expression level. At the second step, it conducts clustering with this modified expression matrix, based on which the final distance weighted imputation is performed. Numerical results based on both simulated and real data show that tsImpute achieves favorable performance in terms of gene expression recovery, cell clustering, and differential expression analysis. AVAILABILITY AND IMPLEMENTATION The R package of tsImpute is available at https://github.com/ZhengWeihuaYNU/tsImpute.
Collapse
Affiliation(s)
- Weihua Zheng
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
| | - Wenwen Min
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
- Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming 650504, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
- Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming 650504, China
| |
Collapse
|
46
|
Wang H, Zhang C, Hong SH, Maye P, Rowe D, Shin DG. CGCom: a framework for inferring Cell-cell Communication based on Graph Neural Network. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.10.566642. [PMID: 38014057 PMCID: PMC10680670 DOI: 10.1101/2023.11.10.566642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Cell-cell communication is crucial in maintaining cellular homeostasis, cell survival and various regulatory relationships among interacting cells. Thanks to recent advances of spatial transcriptomics technologies, we can now explore if and how cells' proximal information available from spatial transcriptomics datasets can be used to infer cell-cell communication. Here we present a cell-cell communication inference framework, called CGCom, which uses a graph neural network (GNN) to learn communication patterns among interacting cells by combining single-cell spatial transcriptomic datasets with publicly available ligand-receptor information and the molecular regulatory information down-stream of the ligand-receptor signaling. To evaluate the performance of CGCom, we applied it to mouse embryo seqFISH datasets. Our results demonstrate that CGCom can not only accurately infer cell communication between individual cell pairs but also generalize its learning to predict communication between different cell types. We compared the performance of CGCom with two existing methods, CellChat and CellPhoneDB, and our comparative study revealed both common and unique communication patterns from the three approaches. Commonly found communication patterns include three sets of ligand-receptor communication relationships, one between surface ectoderm cells and spinal cord cells, one between gut tube cells and endothelium, and one between neural crest and endothelium, all of which have already been reported in the literature thus offering credibility of all three methods. However, we hypothesize that CGCom is superior in reducing false positives thanks to its use of cell proximal information and its learning between specific cell pairs rather than between cell types. CGCom is a GNN-based solution that can take advantage of spatially resolved single-cell transcriptomic data in predicting cell-cell communication with a higher accuracy.
Collapse
|
47
|
Liu Y, Sapoval N, Gallego-García P, Tomás L, Posada D, Treangen TJ, Stadler LB. Crykey: Rapid Identification of SARS-CoV-2 Cryptic Mutations in Wastewater. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.06.16.23291524. [PMID: 37986916 PMCID: PMC10659477 DOI: 10.1101/2023.06.16.23291524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
We present Crykey, a computational tool for rapidly identifying cryptic mutations of SARS-CoV-2. Specifically, we identify co-occurring single nucleotide mutations on the same sequencing read, called linked-read mutations, that are rare or entirely missing in existing databases, and have the potential to represent novel cryptic lineages found in wastewater. While previous approaches exist for identifying cryptic linked-read mutations from specific regions of the SARS-CoV-2 genome, there is a need for computational tools capable of efficiently tracking cryptic mutations across the entire genome and for tens of thousands of samples and with increased scrutiny, given their potential to represent either artifacts or hidden SARS-CoV-2 lineages. Crykey fills this gap by identifying rare linked-read mutations that pass stringent computational filters to limit the potential for artifacts. We evaluate the utility of Crykey on >3,000 wastewater and >22,000 clinical samples; our findings are three-fold: i) we identify hundreds of cryptic mutations that cover the entire SARS-CoV-2 genome, ii) we track the presence of these cryptic mutations across multiple wastewater treatment plants and over a three years of sampling in Houston, and iii) we find a handful of cryptic mutations in wastewater mirror cryptic mutations in clinical samples and investigate their potential to represent real cryptic lineages. In summary, Crykey enables large-scale detection of cryptic mutations representing potential cryptic lineages in wastewater.
Collapse
Affiliation(s)
- Yunxi Liu
- Department of Computer Science, Rice University, Houston, TX, 77005, USA
| | - Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, 77005, USA
| | - Pilar Gallego-García
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO
| | - Laura Tomás
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO
| | - David Posada
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO
- Department of Biochemistry, Genetics, and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX, 77005, USA
| | - Lauren B. Stadler
- Department of Civil and Environmental Engineering, Rice University, Houston, TX, 77005, USA
| |
Collapse
|
48
|
Leduc A, Harens H, Slavov N. Modeling and interpretation of single-cell proteogenomic data. ARXIV 2023:arXiv:2308.07465v2. [PMID: 37645043 PMCID: PMC10462161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Biological functions stem from coordinated interactions among proteins, nucleic acids and small molecules. Mass spectrometry technologies for reliable, high throughput single-cell proteomics will add a new modality to genomics and enable data-driven modeling of the molecular mechanisms coordinating proteins and nucleic acids at single-cell resolution. This promising potential requires estimating the reliability of measurements and computational analysis so that models can distinguish biological regulation from technical artifacts. We highlight different measurement modes that can support single-cell proteogenomic analysis and how to estimate their reliability. We then discuss approaches for developing both abstract and mechanistic models that aim to biologically interpret the measured differences across modalities, including specific applications to directed stem cell differentiation and to inferring protein interactions in cancer cells from the buffing of DNA copy-number variations. Single-cell proteogenomic data will support mechanistic models of direct molecular interactions that will provide generalizable and predictive representations of biological systems.
Collapse
Affiliation(s)
- Andrew Leduc
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single Cell Proteomics Center, and Barnett Institute, Northeastern University, Boston, MA 02115, USA
| | - Hannah Harens
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single Cell Proteomics Center, and Barnett Institute, Northeastern University, Boston, MA 02115, USA
| | - Nikolai Slavov
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single Cell Proteomics Center, and Barnett Institute, Northeastern University, Boston, MA 02115, USA
- Parallel Squared Technology Institute, Watertown, MA 02472, USA
| |
Collapse
|
49
|
Baharav TZ, Tse D, Salzman J. OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.16.533008. [PMID: 37961606 PMCID: PMC10634974 DOI: 10.1101/2023.03.16.533008] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson's X 2 test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson's X 2 test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Collapse
Affiliation(s)
- Tavor Z. Baharav
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305
| | - David Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305
| | - Julia Salzman
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305
- Department of Biochemistry, Stanford University, Stanford, 94305
- Department of Statistics (by courtesy), Stanford University, Stanford, 94305
| |
Collapse
|
50
|
Kim D, Tran A, Kim HJ, Lin Y, Yang JYH, Yang P. Gene regulatory network reconstruction: harnessing the power of single-cell multi-omic data. NPJ Syst Biol Appl 2023; 9:51. [PMID: 37857632 PMCID: PMC10587078 DOI: 10.1038/s41540-023-00312-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/02/2023] [Indexed: 10/21/2023] Open
Abstract
Inferring gene regulatory networks (GRNs) is a fundamental challenge in biology that aims to unravel the complex relationships between genes and their regulators. Deciphering these networks plays a critical role in understanding the underlying regulatory crosstalk that drives many cellular processes and diseases. Recent advances in sequencing technology have led to the development of state-of-the-art GRN inference methods that exploit matched single-cell multi-omic data. By employing diverse mathematical and statistical methodologies, these methods aim to reconstruct more comprehensive and precise gene regulatory networks. In this review, we give a brief overview on the statistical and methodological foundations commonly used in GRN inference methods. We then compare and contrast the latest state-of-the-art GRN inference methods for single-cell matched multi-omics data, and discuss their assumptions, limitations and opportunities. Finally, we discuss the challenges and future directions that hold promise for further advancements in this rapidly developing field.
Collapse
Affiliation(s)
- Daniel Kim
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- Computational Systems Biology Unit, Children's Medical Research Institute, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
| | - Andy Tran
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia
| | - Hani Jieun Kim
- Computational Systems Biology Unit, Children's Medical Research Institute, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
| | - Yingxin Lin
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia.
| | - Pengyi Yang
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia.
- Computational Systems Biology Unit, Children's Medical Research Institute, University of Sydney, Camperdown, NSW, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, Australia.
| |
Collapse
|