1
|
Lee SX, McLachlan GJ, Leemaqz KL. Multi‐node Expectation–Maximization algorithm for finite mixture models. Stat Anal Data Min 2021. [DOI: 10.1002/sam.11529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Sharon X. Lee
- School of Mathematical Sciences University of Adelaide Adelaide South Australia Australia
| | | | - Kaleb L. Leemaqz
- UNSW Business School University of New South Wales Sydney New South Wales Australia
| |
Collapse
|
2
|
JSOM: Jointly-evolving self-organizing maps for alignment of biological datasets and identification of related clusters. PLoS Comput Biol 2021; 17:e1008804. [PMID: 33724985 PMCID: PMC7963045 DOI: 10.1371/journal.pcbi.1008804] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Accepted: 02/15/2021] [Indexed: 11/19/2022] Open
Abstract
With the rapid advances of various single-cell technologies, an increasing number of single-cell datasets are being generated, and the computational tools for aligning the datasets which make subsequent integration or meta-analysis possible have become critical. Typically, single-cell datasets from different technologies cannot be directly combined or concatenated, due to the innate difference in the data, such as the number of measured parameters and the distributions. Even datasets generated by the same technology are often affected by the batch effect. A computational approach for aligning different datasets and hence identifying related clusters will be useful for data integration and interpretation in large scale single-cell experiments. Our proposed algorithm called JSOM, a variation of the Self-organizing map, aligns two related datasets that contain similar clusters, by constructing two maps—low-dimensional discretized representation of datasets–that jointly evolve according to both datasets. Here we applied the JSOM algorithm to flow cytometry, mass cytometry, and single-cell RNA sequencing datasets. The resulting JSOM maps not only align the related clusters in the two datasets but also preserve the topology of the datasets so that the maps could be used for further analysis, such as clustering. Biological datasets are now generated more than ever as many data acquisition technologies have been developed over the years, especially single-cell technologies. With increasing amounts of datasets available for larger scale studies, robust computational tools that could align datasets are needed for data integration and interpretation. We present a new algorithm that can align two biological datasets and demonstrated that the algorithm can work with data generated from different data acquisition technologies. Our proposed algorithm produces low dimensional representations of two datasets to align them in a way that preserves the topology of the respective datasets. Such aligned maps facilitate further analysis, such as clustering. The proposed algorithm showed promising results when applied to different combinations of datasets, i.e., flow cytometry to flow cytometry, flow cytometry to mass cytometry, and two different single-cell RNA sequencing technologies. Therefore, our newly developed algorithm could potentially lead to new discoveries that were once difficult to obtain.
Collapse
|
3
|
|
4
|
Ji D, Putzel P, Qian Y, Chang I, Mandava A, Scheuermann RH, Bui JD, Wang H, Smyth P. Machine Learning of Discriminative Gate Locations for Clinical Diagnosis. Cytometry A 2020; 97:296-307. [PMID: 31691488 PMCID: PMC7079150 DOI: 10.1002/cyto.a.23906] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Revised: 08/22/2019] [Accepted: 09/18/2019] [Indexed: 01/03/2023]
Abstract
High-throughput single-cell cytometry technologies have significantly improved our understanding of cellular phenotypes to support translational research and the clinical diagnosis of hematological and immunological diseases. However, subjective and ad hoc manual gating analysis does not adequately handle the increasing volume and heterogeneity of cytometry data for optimal diagnosis. Prior work has shown that machine learning can be applied to classify cytometry samples effectively. However, many of the machine learning classification results are either difficult to interpret without using characteristics of cell populations to make the classification, or suboptimal due to the use of inaccurate cell population characteristics derived from gating boundaries. To date, little has been done to optimize both the gating boundaries and the diagnostic accuracy simultaneously. In this work, we describe a fully discriminative machine learning approach that can simultaneously learn feature representations (e.g., combinations of coordinates of gating boundaries) and classifier parameters for optimizing clinical diagnosis from cytometry measurements. The approach starts from an initial gating position and then refines the position of the gating boundaries by gradient descent until a set of globally-optimized gates across different samples are achieved. The learning procedure is constrained by regularization terms encoding domain knowledge that encourage the algorithm to seek interpretable results. We evaluate the proposed approach using both simulated and real data, producing classification results on par with those generated via human expertise, in terms of both the positions of the gating boundaries and the diagnostic accuracy. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.
Collapse
Affiliation(s)
- Disi Ji
- Department of Computer ScienceUniversity of CaliforniaIrvineCalifornia
| | - Preston Putzel
- Department of Computer ScienceUniversity of CaliforniaIrvineCalifornia
| | - Yu Qian
- InformaticsJ. Craig Venter InstituteLa JollaCalifornia
| | - Ivan Chang
- InformaticsJ. Craig Venter InstituteLa JollaCalifornia
| | | | - Richard H. Scheuermann
- InformaticsJ. Craig Venter InstituteLa JollaCalifornia
- Department of PathologyUniversity of CaliforniaSan Diego, La JollaCalifornia
| | - Jack D. Bui
- Department of PathologyUniversity of CaliforniaSan Diego, La JollaCalifornia
| | - Huan‐You Wang
- Department of PathologyUniversity of CaliforniaSan Diego, La JollaCalifornia
| | - Padhraic Smyth
- Department of Computer ScienceUniversity of CaliforniaIrvineCalifornia
| |
Collapse
|
5
|
Qi Y, Fang Y, Sinclair DR, Guo S, Alberich-Jorda M, Lu J, Tenen DG, Kharas MG, Pyne S. High-speed automatic characterization of rare events in flow cytometric data. PLoS One 2020; 15:e0228651. [PMID: 32045462 PMCID: PMC7012421 DOI: 10.1371/journal.pone.0228651] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 01/21/2020] [Indexed: 11/19/2022] Open
Abstract
A new computational framework for FLow cytometric Analysis of Rare Events (FLARE) has been developed specifically for fast and automatic identification of rare cell populations in very large samples generated by platforms like multi-parametric flow cytometry. Using a hierarchical Bayesian model and information-sharing via parallel computation, FLARE rapidly explores the high-dimensional marker-space to detect highly rare populations that are consistent across multiple samples. Further it can focus within specified regions of interest in marker-space to detect subpopulations with desired precision.
Collapse
Affiliation(s)
- Yuan Qi
- Department of Computer Science, Purdue University, West Lafayette, IN, United States of America
- Department of Statistics, Purdue University, West Lafayette, IN, United States of America
- * E-mail: (YQ); (SP)
| | - Youhan Fang
- Department of Computer Science, Purdue University, West Lafayette, IN, United States of America
| | - David R. Sinclair
- Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
- Public Health Dynamics Laboratory, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, United States of America
- Department of Health Policy and Management, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Shangqin Guo
- Department of Cell Biology, Yale University School of Medicine, New Haven, CT, United States of America
| | | | - Jun Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, United States of America
- Yale Stem Cell Center, Yale University School of Medicine, New Haven, CT, United States of America
| | - Daniel G. Tenen
- Center for Life Sciences, Harvard Medical School, Boston, MA, United States of America
- Harvard Stem Cell Institute, Harvard Medical School, Boston, MA, United States of America
- Cancer Science Institute, National University of Singapore, Singapore, Singapore
| | - Michael G. Kharas
- Molecular Pharmacology Program, Memorial Sloan Kettering Cancer Center, New York, NY, United States of America
| | - Saumyadipta Pyne
- Public Health Dynamics Laboratory, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, United States of America
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, United States of America
- * E-mail: (YQ); (SP)
| |
Collapse
|
6
|
Jain S, Levine M, Radivojac P, Trosset MW. Identifiability of two‐component skew normal mixtures with one known component. Scand Stat Theory Appl 2019. [DOI: 10.1111/sjos.12377] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Shantanu Jain
- Khoury College of Computer and Information SciencesNortheastern University Boston Massachusetts
| | - Michael Levine
- Department of StatisticsPurdue University West Lafayette Indiana
| | - Predrag Radivojac
- Khoury College of Computer and Information SciencesNortheastern University Boston Massachusetts
| | | |
Collapse
|
7
|
Lee SX, Leemaqz KL, McLachlan GJ. A Block EM Algorithm for Multivariate Skew Normal and Skew -Mixture Models. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:5581-5591. [PMID: 29993871 DOI: 10.1109/tnnls.2018.2805317] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Finite mixtures of skew distributions provide a flexible tool for modeling heterogeneous data with asymmetric distributional features. However, parameter estimation via the Expectation-Maximization (EM) algorithm can become very time consuming due to the complicated expressions involved in the E-step that are numerically expensive to evaluate. While parallelizing the EM algorithm can offer considerable speedup in time performance, current implementations focus almost exclusively on distributed platforms. In this paper, we consider instead the most typical operating environment for users of mixture models-a standalone multicore machine and the R programming environment. We develop a block implementation of the EM algorithm that facilitates the calculations on the E- and M-steps to be spread across a number of threads. We focus on the fitting of finite mixtures of multivariate skew normal and skew distributions, and show that both the E- and M-steps in the EM algorithm can be modified to allow the data to be split into blocks. Our approach is easy to implement and provides immediate benefits to users of multicore machines. Experiments were conducted on two real data sets to demonstrate the effectiveness of the proposed approach.
Collapse
|
8
|
Orlova DY, Meehan S, Parks D, Moore WA, Meehan C, Zhao Q, Ghosn EEB, Herzenberg LA, Walther G. QFMatch: multidimensional flow and mass cytometry samples alignment. Sci Rep 2018; 8:3291. [PMID: 29459702 PMCID: PMC5818510 DOI: 10.1038/s41598-018-21444-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Accepted: 02/05/2018] [Indexed: 12/15/2022] Open
Abstract
Part of the flow/mass cytometry data analysis process is aligning (matching) cell subsets between relevant samples. Current methods address this cluster-matching problem in ways that are either computationally expensive, affected by the curse of dimensionality, or fail when population patterns significantly vary between samples. Here, we introduce a quadratic form (QF)-based cluster matching algorithm (QFMatch) that is computationally efficient and accommodates cases where population locations differ significantly (or even disappear or appear) from sample to sample. We demonstrate the effectiveness of QFMatch by evaluating sample datasets from immunology studies. The algorithm is based on a novel multivariate extension of the quadratic form distance for the comparison of flow cytometry data sets. We show that this QF distance has attractive computational and statistical properties that make it well suited for analysis tasks that involve the comparison of flow/mass cytometry samples.
Collapse
Affiliation(s)
- Darya Y Orlova
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
| | - Stephen Meehan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - David Parks
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Wayne A Moore
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Connor Meehan
- Department of Mathematics, California Institute of Technology, Pasadena, CA, USA
| | - Qian Zhao
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Eliver E B Ghosn
- Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - Leonore A Herzenberg
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Guenther Walther
- Department of Statistics, Stanford University, Stanford, CA, USA
| |
Collapse
|
9
|
Azad A, Rajwa B, Pothen A. Immunophenotype Discovery, Hierarchical Organization, and Template-Based Classification of Flow Cytometry Samples. Front Oncol 2016; 6:188. [PMID: 27630823 PMCID: PMC5005935 DOI: 10.3389/fonc.2016.00188] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 08/08/2016] [Indexed: 01/22/2023] Open
Abstract
We describe algorithms for discovering immunophenotypes from large collections of flow cytometry samples and using them to organize the samples into a hierarchy based on phenotypic similarity. The hierarchical organization is helpful for effective and robust cytometry data mining, including the creation of collections of cell populations’ characteristic of different classes of samples, robust classification, and anomaly detection. We summarize a set of samples belonging to a biological class or category with a statistically derived template for the class. Whereas individual samples are represented in terms of their cell populations (clusters), a template consists of generic meta-populations (a group of homogeneous cell populations obtained from the samples in a class) that describe key phenotypes shared among all those samples. We organize an FC data collection in a hierarchical data structure that supports the identification of immunophenotypes relevant to clinical diagnosis. A robust template-based classification scheme is also developed, but our primary focus is in the discovery of phenotypic signatures and inter-sample relationships in an FC data collection. This collective analysis approach is more efficient and robust since templates describe phenotypic signatures common to cell populations in several samples while ignoring noise and small sample-specific variations. We have applied the template-based scheme to analyze several datasets, including one representing a healthy immune system and one of acute myeloid leukemia (AML) samples. The last task is challenging due to the phenotypic heterogeneity of the several subtypes of AML. However, we identified thirteen immunophenotypes corresponding to subtypes of AML and were able to distinguish acute promyelocytic leukemia (APL) samples with the markers provided. Clinically, this is helpful since APL has a different treatment regimen from other subtypes of AML. Core algorithms used in our data analysis are available in the flowMatch package at www.bioconductor.org. It has been downloaded nearly 6,000 times since 2014.
Collapse
Affiliation(s)
- Ariful Azad
- Lawrence Berkeley National Laboratory, Computational Research Division , Berkeley, CA , USA
| | - Bartek Rajwa
- Bindley Bioscience Center, Purdue University , West Lafayette, IN , USA
| | - Alex Pothen
- Department of Computer Science, Purdue University , West Lafayette, IN , USA
| |
Collapse
|
10
|
Azad A, Rajwa B, Pothen A. flowVS: channel-specific variance stabilization in flow cytometry. BMC Bioinformatics 2016; 17:291. [PMID: 27465477 PMCID: PMC4964071 DOI: 10.1186/s12859-016-1083-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 05/14/2016] [Indexed: 01/21/2023] Open
Abstract
Background Comparing phenotypes of heterogeneous cell populations from multiple biological conditions is at the heart of scientific discovery based on flow cytometry (FC). When the biological signal is measured by the average expression of a biomarker, standard statistical methods require that variance be approximately stabilized in populations to be compared. Since the mean and variance of a cell population are often correlated in fluorescence-based FC measurements, a preprocessing step is needed to stabilize the within-population variances. Results We present a variance-stabilization algorithm, called flowVS, that removes the mean-variance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett’s likelihood-ratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. flowVS is therefore an explicit variance-stabilization method that stabilizes within-population variances in each channel by evaluating the homoskedasticity of clusters with a likelihood-ratio test. With two publicly available datasets, we show that flowVS removes the mean-variance dependence from raw FC data and makes the within-population variance relatively homogeneous. We demonstrate that alternative transformation techniques such as flowTrans, flowScape, logicle, and FCSTrans might not stabilize variance. Besides flow cytometry, flowVS can also be applied to stabilize variance in microarray data. With a publicly available data set we demonstrate that flowVS performs as well as the VSN software, a state-of-the-art approach developed for microarrays. Conclusions The homogeneity of variance in cell populations across FC samples is desirable when extracting features uniformly and comparing cell populations with different levels of marker expressions. The newly developed flowVS algorithm solves the variance-stabilization problem in FC and microarrays by optimally transforming data with the help of Bartlett’s likelihood-ratio test. On two publicly available FC datasets, flowVS stabilizes within-population variances more evenly than the available transformation and normalization techniques. flowVS-based variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples. flowVS and the datasets used in this paper are publicly available in Bioconductor.
Collapse
Affiliation(s)
- Ariful Azad
- Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, 94720, CA, USA.
| | - Bartek Rajwa
- Bindley Bioscience Center, Purdue University, West Lafayette, 47907, IN, USA
| | - Alex Pothen
- Department of Computer Science, Purdue University, West Lafayette, 47907, IN, USA
| |
Collapse
|
11
|
|