1
|
Chattopadhyay A, Slocum S, Haeffele BD, Vidal R, Geman D. Interpretable by Design: Learning Predictors by Composing Interpretable Queries. IEEE Trans Pattern Anal Mach Intell 2023; 45:7430-7443. [PMID: 36441893 DOI: 10.1109/tpami.2022.3225162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
There is a growing concern about typically opaque decision-making with high-performance machine learning algorithms. Providing an explanation of the reasoning process in domain-specific terms can be crucial for adoption in risk-sensitive domains such as healthcare. We argue that machine learning algorithms should be interpretable by design and that the language in which these interpretations are expressed should be domain- and task-dependent. Consequently, we base our model's prediction on a family of user-defined and task-specific binary functions of the data, each having a clear interpretation to the end-user. We then minimize the expected number of queries needed for accurate prediction on any given input. As the solution is generally intractable, following prior work, we choose the queries sequentially based on information gain. However, in contrast to previous work, we need not assume the queries are conditionally independent. Instead, we leverage a stochastic generative model (VAE) and an MCMC algorithm (Unadjusted Langevin) to select the most informative query about the input based on previous query-answers. This enables the online determination of a query chain of whatever depth is required to resolve prediction ambiguities. Finally, experiments on vision and NLP tasks demonstrate the efficacy of our approach and its superiority over post-hoc explanations.
Collapse
|
2
|
Wang M, Barker PB, Cascella NG, Coughlin JM, Nestadt G, Nucifora FC, Sedlak TW, Kelly A, Younes L, Geman D, Palaniyappan L, Sawa A, Yang K. Longitudinal changes in brain metabolites in healthy controls and patients with first episode psychosis: a 7-Tesla MRS study. Mol Psychiatry 2023; 28:2018-2029. [PMID: 36732587 PMCID: PMC10394114 DOI: 10.1038/s41380-023-01969-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 01/13/2023] [Accepted: 01/17/2023] [Indexed: 02/04/2023]
Abstract
Seven Tesla magnetic resonance spectroscopy (7T MRS) offers a precise measurement of metabolic levels in the human brain via a non-invasive approach. Studying longitudinal changes in brain metabolites could help evaluate the characteristics of disease over time. This approach may also shed light on how the age of study participants and duration of illness may influence these metabolites. This study used 7T MRS to investigate longitudinal patterns of brain metabolites in young adulthood in both healthy controls and patients. A four-year longitudinal cohort with 38 patients with first episode psychosis (onset within 2 years) and 48 healthy controls was used to examine 10 brain metabolites in 5 brain regions associated with the pathophysiology of psychosis in a comprehensive manner. Both patients and controls were found to have significant longitudinal reductions in glutamate in the anterior cingulate cortex (ACC). Only patients were found to have a significant decrease over time in γ-aminobutyric acid, N-acetyl aspartate, myo-inositol, total choline, and total creatine in the ACC. Together we highlight the ACC with dynamic changes in several metabolites in early-stage psychosis, in contrast to the other 4 brain regions that also are known to play roles in psychosis. Meanwhile, glutathione was uniquely found to have a near zero annual percentage change in both patients and controls in all 5 brain regions during a four-year follow-up in young adulthood. Given that a reduction of the glutathione in the ACC has been reported as a feature of treatment-refractory psychosis, this observation further supports the potential of glutathione as a biomarker for this subset of patients with psychosis.
Collapse
Affiliation(s)
- Min Wang
- Russell H Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Peter B Barker
- Russell H Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- F. M. Kirby Research Center for Functional Brain Imaging, Kennedy Krieger Institute, Baltimore, MD, USA.
| | - Nicola G Cascella
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Jennifer M Coughlin
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Gerald Nestadt
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Frederick C Nucifora
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Thomas W Sedlak
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Alexandra Kelly
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Laurent Younes
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
| | - Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
| | - Lena Palaniyappan
- Robarts Research Institution, University of Western Ontario, London, ON, Canada
- Department of Psychiatry, University of Western Ontario, London, ON, Canada
- Douglas Mental Health University Institute, Department of Psychiatry, McGill University, Montreal, QC, Canada
| | - Akira Sawa
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- Department of Pharmacology and Molecular Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- Department of Mental Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA.
| | - Kun Yang
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
3
|
Ji L, Wang A, Sonthalia S, Naiman DQ, Younes L, Colantuoni C, Geman D. CellCover Defines Conserved Cell Types and Temporal Progression in scRNA-seq Data across Mammalian Neocortical Development. bioRxiv 2023:2023.04.06.535943. [PMID: 37383947 PMCID: PMC10299349 DOI: 10.1101/2023.04.06.535943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/30/2023]
Abstract
Accurate identification of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. Such analyses are often based on the existence of highly discriminating "marker genes" for specific cell classes which enables a deeper functional understanding of these classes as well as their identification in new, related datasets. Currently, marker genes are defined by methods that serially assess the level of differential expression (DE) of individual genes across landscapes of diverse cells. This serial approach has been extremely useful, but is limited because it ignores possible redundancy or complementarity across genes, that can only be captured by analyzing several genes at the same time. We wish to identify discriminating panels of genes. To efficiently explore the vast space of possible marker panels, leverage the large number of cells often sequenced, and overcome zero-inflation in scRNA-seq data, we propose viewing panel selection as a variation of the "minimal set-covering problem" in combinatorial optimization which can be solved with integer programming. In this formulation, the covering elements are genes, and the objects to be covered are cells of a particular class, where a cell is covered by a gene if that gene is expressed in that cell. Our method, CellCover, identifies a panel of marker genes in scRNA-seq data that covers one class of cells within a population. We apply this method to generate covering marker gene panels which characterize cells of the developing mouse neocortex as postmitotic neurons are generated from neural progenitor cells (NPCs). We show that CellCover captures cell class-specific signals distinct from those defined by DE methods and that CellCover's compact gene panels can be expanded to explore cell type specific function.Transfer learning experiments exploring these covering panels across in vivo mouse, primate, and human scRNA-seq datasets demonstrate that CellCover identifies markers of conserved cell classes in neurogenesis, as well as markers of temporal progression in the molecular identity of these cell types across development of the mammalian neocortex. The gene covering panels we identify across cell types and developmental time can be freely explored in visualizations across all the public data we use in this report at with NeMo Analytics [1] through https://nemoanalytics.org/p?l=CellCover . The code for CellCover is written in R and the Gurobi R interface and is available at [2].
Collapse
|
4
|
Omar M, Dinalankara W, Mulder L, Coady T, Zanettini C, Imada EL, Younes L, Geman D, Marchionni L. Using biological constraints to improve prediction in precision oncology. iScience 2023; 26:106108. [PMID: 36852282 PMCID: PMC9958363 DOI: 10.1016/j.isci.2023.106108] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 12/20/2022] [Accepted: 01/28/2023] [Indexed: 02/05/2023] Open
Abstract
Many gene signatures have been developed by applying machine learning (ML) on omics profiles, however, their clinical utility is often hindered by limited interpretability and unstable performance. Here, we show the importance of embedding prior biological knowledge in the decision rules yielded by ML approaches to build robust classifiers. We tested this by applying different ML algorithms on gene expression data to predict three difficult cancer phenotypes: bladder cancer progression to muscle-invasive disease, response to neoadjuvant chemotherapy in triple-negative breast cancer, and prostate cancer metastatic progression. We developed two sets of classifiers: mechanistic, by restricting the training to features capturing specific biological mechanisms; and agnostic, in which the training did not use any a priori biological information. Mechanistic models had a similar or better testing performance than their agnostic counterparts, with enhanced interpretability. Our findings support the use of biological constraints to develop robust gene signatures with high translational potential.
Collapse
Affiliation(s)
- Mohamed Omar
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Wikum Dinalankara
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Lotte Mulder
- Technical University Delft, 2628 CD Delft, the Netherlands
| | - Tendai Coady
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Claudio Zanettini
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Eddie Luidy Imada
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Laurent Younes
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Luigi Marchionni
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
5
|
Ke Q, Dinalankara W, Younes L, Geman D, Marchionni L. Abstract 173: Efficient representations of tumor diversity with paired DNA-RNA aberrations. Cancer Res 2021. [DOI: 10.1158/1538-7445.am2021-173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
In this work we develop a framework which allows for a systematic analysis of joint DNA and putative downstream RNA effects in cancer data cohorts. Using the Reactome database, we extract gene pairs that are linked by known mechanistic connections. Such pairs, which we refer to as 'Source Target Pairs' or STPs, consist of a source gene for which we examine aberrant activity in the DNA profile, and a target gene that is affected by said source gene, for which we examine aberrant activity in the RNA profile.
Using TCGA data for six different cancer types (breast, colon, kidney, liver, lung and prostate), we use mutation and copy number variation information to compile DNA aberrant activity data. For the same cancer cohorts, we use RNASeq gene expression data to quantify RNA aberrant activity via the previous 'divergence' method we have developed. In the divergence framework, normal samples from the same cancer are used to estimate a normal range of expression for target genes of interest and deviation from the normal range is assumed to indicate aberrant activity which may result from upstream DNA aberrations. Then for a given sample, an STP can be represented as a binary variable, indicating presence or absence of joint DNA-RNA aberrant activity.
We utilize integer programming to discover a small set of such STPs for each cancer type such that every sample displays aberrant activity in at least one STP. We refer to these reduced STP configurations as 'minimal coverings' of that cancer. These configurations then allow for the quantification of heterogeneity for that cancer type, as well as for phenotypical groups of interest. This is made possible due to the fact that sample to sample variability can be compared via the entropy of the distribution of the minimal covering, where the small number of STPs in such a configuration makes the computation more tractable.
Our results reveal many known putative drivers of cancer, as well as identify some novel genes of interest for further consideration. Comparison of heterogeneity across phenotypes of interest show higher entropy in more pathological phenotypes, indicating increasing heterogeneity with severity of disease.
Citation Format: Qian Ke, Wikum Dinalankara, Laurent Younes, Donald Geman, Luigi Marchionni. Efficient representations of tumor diversity with paired DNA-RNA aberrations [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 173.
Collapse
Affiliation(s)
- Qian Ke
- 1Johns Hopkins University, Baltimore, MD
| | | | | | | | | |
Collapse
|
6
|
Baloni P, Dinalankara W, Earls JC, Knijnenburg TA, Geman D, Marchionni L, Price ND. Identifying Personalized Metabolic Signatures in Breast Cancer. Metabolites 2020; 11:20. [PMID: 33396819 PMCID: PMC7823382 DOI: 10.3390/metabo11010020] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Revised: 12/23/2020] [Accepted: 12/28/2020] [Indexed: 01/04/2023] Open
Abstract
Cancer cells are adept at reprogramming energy metabolism, and the precise manifestation of this metabolic reprogramming exhibits heterogeneity across individuals (and from cell to cell). In this study, we analyzed the metabolic differences between interpersonal heterogeneous cancer phenotypes. We used divergence analysis on gene expression data of 1156 breast normal and tumor samples from The Cancer Genome Atlas (TCGA) and integrated this information with a genome-scale reconstruction of human metabolism to generate personalized, context-specific metabolic networks. Using this approach, we classified the samples into four distinct groups based on their metabolic profiles. Enrichment analysis of the subsystems indicated that amino acid metabolism, fatty acid oxidation, citric acid cycle, androgen and estrogen metabolism, and reactive oxygen species (ROS) detoxification distinguished these four groups. Additionally, we developed a workflow to identify potential drugs that can selectively target genes associated with the reactions of interest. MG-132 (a proteasome inhibitor) and OSU-03012 (a celecoxib derivative) were the top-ranking drugs identified from our analysis and known to have anti-tumor activity. Our approach has the potential to provide mechanistic insights into cancer-specific metabolic dependencies, ultimately enabling the identification of potential drug targets for each patient independently, contributing to a rational personalized medicine approach.
Collapse
Affiliation(s)
- Priyanka Baloni
- Institute for Systems Biology, Seattle, WA 98109, USA; (P.B.); (J.C.E.); (T.A.K.)
| | - Wikum Dinalankara
- Department of Oncology, Sydney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA;
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - John C. Earls
- Institute for Systems Biology, Seattle, WA 98109, USA; (P.B.); (J.C.E.); (T.A.K.)
| | - Theo A. Knijnenburg
- Institute for Systems Biology, Seattle, WA 98109, USA; (P.B.); (J.C.E.); (T.A.K.)
| | - Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21205, USA;
| | - Luigi Marchionni
- Department of Oncology, Sydney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA;
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Nathan D. Price
- Institute for Systems Biology, Seattle, WA 98109, USA; (P.B.); (J.C.E.); (T.A.K.)
| |
Collapse
|
7
|
Afsari B, Cope L, Gaykalova DA, Geman D, Puram S, Goff LA, Favorov A, Fertig EJ. Abstract 3399: Uncovering hidden sources of transcriptional dysregulation arising from inter- and intra-tumor heterogeneity. Cancer Res 2019. [DOI: 10.1158/1538-7445.am2019-3399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Introduction: This study develops an innovative computational framework, Expression Variation Analysis (EVA), to model transcriptional dysregulation in cancer. Heterogeneity poses a major challenge in translational research. For example, inter-tumor heterogeneity limits the biomarker discovery and intra-tumor heterogeneity enables therapeutic resistance. Moreover, in some cancers driver mutations are insufficient to account for the widespread transcriptional variation responsible for these outcomes. Thus, new computational tools to model transcriptional variation are essential.
Methods: EVA is a unified computational framework to model transcriptional variation in cancer. Briefly, EVA quantifies transcriptional heterogeneity for one set of samples or cells from one phenotype using the expected dissimilarity between pairs of expression profiles. U-statistics theory can then quantify the statistical significance of the difference in transcriptional heterogeneity between phenotypes.
Results: We apply EVA to perform a comprehensive characterization of transcriptional variation in head and neck squamous cell carcinoma (HNSCC). At a pathway level, transcriptional variation in HNSCC tumors is higher than normal controls. Applying EVA to integrate ChIP-seq data with RNA-seq reveals that these pervasive transcriptional differences occur in enhancers. Similarly, applying EVA at a gene level to model splicing reveals more heterogeneity in transcript usage in tumor samples than normals. HPV- HNSCC tumors are unique in having mutations in genes that regulate the splicing machinery, and the HPV- tumors with these alterations have a greater number of dysregulated splice variants than those without. Nonetheless, the EVA analysis identifies a similar number of alternative splice variants in HPV+ as HPV- tumors suggesting an alternative mechanism of transcriptional heterogeneity in HPV+ disease. Adapting EVA to single cell data demonstrates that increased fibroblast composition is associated with greater variation in immune pathway activity in HNSCC. Moreover, we observe greater transcriptional heterogeneity in HNSCC primary tumors than lymph node metastasis consistent with a clonal outgrowth.
Conclusions: We demonstrate that the statistical framework from EVA enables differential heterogeneity analysis in HNSCC ranging from pathway dysregulation, splice variation, epigenetic regulation, and single cell analysis. This algorithm provides a critical framework to model the hidden multi-molecular mechanisms underlying the complex patient outcomes that are pervasive in cancer.
Citation Format: Bahman Afsari, Leslie Cope, Daria A. Gaykalova, Donald Geman, Sidharth Puram, Loyal A. Goff, Alexander Favorov, Elana Judith Fertig. Uncovering hidden sources of transcriptional dysregulation arising from inter- and intra-tumor heterogeneity [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 3399.
Collapse
Affiliation(s)
- Bahman Afsari
- 1Johns Hopkins Sidney Kimmel Comp. Cancer Ctr., Baltimore, MD
| | - Leslie Cope
- 1Johns Hopkins Sidney Kimmel Comp. Cancer Ctr., Baltimore, MD
| | | | | | | | | | | | | |
Collapse
|
8
|
Afsari B, Guo T, Considine M, Florea L, Kagohara LT, Stein-O'Brien GL, Kelley D, Flam E, Zambo KD, Ha PK, Geman D, Ochs MF, Califano JA, Gaykalova DA, Favorov AV, Fertig EJ. Splice Expression Variation Analysis (SEVA) for inter-tumor heterogeneity of gene isoform usage in cancer. Bioinformatics 2019; 34:1859-1867. [PMID: 29342249 DOI: 10.1093/bioinformatics/bty004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 01/10/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation Current bioinformatics methods to detect changes in gene isoform usage in distinct phenotypes compare the relative expected isoform usage in phenotypes. These statistics model differences in isoform usage in normal tissues, which have stable regulation of gene splicing. Pathological conditions, such as cancer, can have broken regulation of splicing that increases the heterogeneity of the expression of splice variants. Inferring events with such differential heterogeneity in gene isoform usage requires new statistical approaches. Results We introduce Splice Expression Variability Analysis (SEVA) to model increased heterogeneity of splice variant usage between conditions (e.g. tumor and normal samples). SEVA uses a rank-based multivariate statistic that compares the variability of junction expression profiles within one condition to the variability within another. Simulated data show that SEVA is unique in modeling heterogeneity of gene isoform usage, and benchmark SEVA's performance against EBSeq, DiffSplice and rMATS that model differential isoform usage instead of heterogeneity. We confirm the accuracy of SEVA in identifying known splice variants in head and neck cancer and perform cross-study validation of novel splice variants. A novel comparison of splice variant heterogeneity between subtypes of head and neck cancer demonstrated unanticipated similarity between the heterogeneity of gene isoform usage in HPV-positive and HPV-negative subtypes and anticipated increased heterogeneity among HPV-negative samples with mutations in genes that regulate the splice variant machinery. These results show that SEVA accurately models differential heterogeneity of gene isoform usage from RNA-seq data. Availability and implementation SEVA is implemented in the R/Bioconductor package GSReg. Contact bahman@jhu.edu or favorov@sensi.org or ejfertig@jhmi.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahman Afsari
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Theresa Guo
- Department of Otolaryngology-Head and Neck Surgery
| | - Michael Considine
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Liliana Florea
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Luciane T Kagohara
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Genevieve L Stein-O'Brien
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Dylan Kelley
- Department of Otolaryngology-Head and Neck Surgery
| | - Emily Flam
- Department of Otolaryngology-Head and Neck Surgery
| | | | - Patrick K Ha
- Department of Otolaryngology-Head and Neck Surgery, University of California, San Francisco, CA 94158, USA
| | - Donald Geman
- Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Michael F Ochs
- Department of Mathematics & Statistics, The College of New Jersey, Ewing, NJ 08628, USA
| | - Joseph A Califano
- Division of Otolaryngology, Department of Surgery, University of California, San Diego, CA 92093, USA
| | | | - Alexander V Favorov
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center.,Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, RAS, Moscow 119333, Russia
| | - Elana J Fertig
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| |
Collapse
|
9
|
Lahouel K, Geman D, Younes L. Coarse-to-fine multiple testing strategies. Electron J Stat 2019. [DOI: 10.1214/19-ejs1536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Slama P, Hoopmann MR, Moritz RL, Geman D. Robust determination of differential abundance in shotgun proteomics using nonparametric statistics. Mol Omics 2018; 14:424-436. [PMID: 30259924 PMCID: PMC6490964 DOI: 10.1039/c8mo00077h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Label-free shotgun mass spectrometry enables the detection of significant changes in protein abundance between different conditions. Due to often limited cohort sizes or replication, large ratios of potential protein markers to number of samples, as well as multiple null measurements pose important technical challenges to conventional parametric models. From a statistical perspective, a scenario similar to that of unlabeled proteomics is encountered in genomics when looking for differentially expressed genes. Still, the difficulty of detecting a large fraction of the true positives without a high false discovery rate is arguably greater in proteomics due to even smaller sample sizes and peptide-to-peptide variability in detectability. These constraints argue for nonparametric (or distribution-free) tests on normalized peptide values, thus minimizing the number of free parameters, as well as for measuring significance with permutation testing. We propose such a procedure with a class-based statistic, no parametric assumptions, and no parameters to select other than a nominal false discovery rate. Our method was tested on a new dataset which is available via ProteomeXchange with identifier PXD006447. The dataset was prepared using a standard proteolytic digest of a human protein mixture at 1.5-fold to 3-fold protein concentration changes and diluted into a constant background of yeast proteins. We demonstrate its superiority relative to other approaches in terms of the realized sensitivity and realized false discovery rates determined by ground truth, and recommend it for detecting differentially abundant proteins from MS data.
Collapse
Affiliation(s)
- Patrick Slama
- Center for Imaging Science, Institute for Computational Medicine, Johns Hopkins University, USA.
- Independent Researcher, Paris, France
| | | | - Robert L. Moritz
- Institute for Systems Biology, 401 Terry Avenue N, Seattle, WA, USA 98109
| | - Donald Geman
- Center for Imaging Science, Institute for Computational Medicine, Johns Hopkins University, USA.
- Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD, 21218
| |
Collapse
|
11
|
Afsari B, Guo T, Considine M, Kelley D, Flam E, Florea L, Ha P, Geman D, Ochs MF, Califano JA, Gaykalova DA, Favorov AV, Fertig EJ. Abstract 3577: Splice expression variation analysis (SEVA) for differential gene isoform usage in cancer. Cancer Res 2017. [DOI: 10.1158/1538-7445.am2017-3577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Alternative splicing events (ASE) are a significant component of expression alterations in cancer, and have been demonstrated to be critically important in the development of malignant phenotypes in a variety of tumors. These alternative gene isoforms alter cell-signaling networks and serve as a hidden source of tumor-driving alterations not identified in multi-omics analyses. Recent studies have demonstrated that reads from RNA-seq data can infer gene isoforms expressed in a single sample. Therefore, RNA-seq data of tumors offers the opportunity to systematically evaluate expressed gene isoforms and identify splicing events in cancer samples.
To characterize a cancer specific ASEs landscape, it is essential to perform differential splice variant expression analysis to identify isoform variants that are unique to tumor samples compared to normal tissue. In spite of the breadth of ASE algorithms, few have been validated in primary tumor samples. Current methods for differential splice variant analysis compare mean expression of gene isoforms in sample groups. Because these variants are tumor-specific, ASEs are expected to have more variable exon junction expression than normal samples. Therefore, current differential ASE analysis algorithms from RNA-seq may not account for heterogeneous gene isoform usage in tumors. To address this, we introduce Splice Expression Variability Analysis (SEVA) to detect differential splice variation usage in tumor and normal samples and accounts for tumor heterogeneity. This algorithm compares the degree of variability of junction expression profiles within a population of normal samples relative to that in tumor samples.
The performance of SEVA was compared with two existing algorithms, EBSeq and DiffSplice, in simulated and real RNA-seq data. Simulated data suggest that SEVA is robust and computationally efficient relative to EBSeq and DiffSplice. In contrast to EBSeq and DiffSplice, SEVA was able to identify alternative splicing events independent of overall gene expression differences. Finally, additional validation was performed using RNA-seq data for primary tumor data from HPV-positive oropharynx squamous cell carcinoma (OPSCC) tumors and normal samples from both TCGA and an independent tumor cohort of 46 OPSCC tumors and 25 normal samples. In these tumor samples, SEVA finds cancer-specific ASEs in genes that are independent of their differential expression status. Moreover, SEVA finds approximately hundreds of splice variant candidates, manageable for experimental validation in contrast to the thousands of candidates found with EBSeq or DiffSplice. These candidates include experimentally validated splice variants in HNSCC from a previous microarray study. Based on performance in both simulated and real data, SEVA represents a robust algorithm that is well suited for differential ASE analysis, particularly in RNA-sequencing data from heterogeneous primary tumor samples.
Citation Format: Bahman Afsari, Theresa Guo, Michael Considine, Dylan Kelley, Emily Flam, Liliana Florea, Patrick Ha, Donald Geman, Michael F. Ochs, Joseph A. Califano, Daria A. Gaykalova, Alexander V. Favorov, Elana J. Fertig. Splice expression variation analysis (SEVA) for differential gene isoform usage in cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 3577. doi:10.1158/1538-7445.AM2017-3577
Collapse
Affiliation(s)
| | | | | | | | - Emily Flam
- 1Johns Hopkins University, Baltimore, MD
| | | | - Patrick Ha
- 2University of California, San Francisco, CA
| | | | | | | | | | | | | |
Collapse
|
12
|
Dinalankara W, Qe Q, Ji L, Xu Y, Pagane N, Lobo F, Younes L, Geman D, Marchionni L. Abstract 4551: Divergence analysis with coarse coding of omics data across cancer phenotypes. Cancer Res 2017. [DOI: 10.1158/1538-7445.am2017-4551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Motivation: Complex cancer omics data can be difficult to interpret and analyze with standard statistical methods. We thereby propose an innovative data representation that drastically reduces complexity while improving usability and interpretability for complex cancer phenotype analysis.
Method: Despite recent advances in omics technologies, the robustness of predictive biomarkers in cancer remains severely limited. We hypothesize that this is primarily due to an overemphasis on applying statistical learning methods without taking into consideration the underlying biological processes driving cancer. We therefore propose a new approach based on representing data based on the comparison to a baseline group. This results in a data format that encodes biologically meaningful information and can be easily analyzed. We apply this transformation to publicly available datasets obtained across multiple tumor types using different omics technologies. For each cancer phenotype considered, we cross-validate the learned decision rules using SVMs and random forests and demonstrate that there is no drop in performance despite the use of a simplified data representation. We also apply the Chi-squared test to our simplified data to select genomic features differentially associated with relevant cancer phenotypes. To this end we compare our method to traditional class comparison approaches. Overall, this analysis shows that omics features selected by our method provides equal or better classification performance than standard methods. Further, we show that our simplified data representation filters out much of the biologically irrelevant variation and that the resulting data can be successfully applied to gene set analysis applications, ultimately improving inference on disease phenotypes. For instance, by applying our method to signaling pathways and cancer hallmarks gene sets, we show that our approach can be used to detect dysregulated pathways more efficiently than with traditional methods.
Conclusion: By comparing cancer omics data to a baseline status, we obtain a much simpler data representation that preserves biologically relevant information while eliminating much of the unwanted variance that is often confounding in the analysis of high-dimensional data. Furthermore, data represented using our approach can be easily stored and analyzed, and it is equivalent or superior to traditional data representation methods for predicting clinically relevant cancer phenotypes and detecting biologically relevant cancer pathways.
Citation Format: Wikum Dinalankara, Qian Qe, Lanlan Ji, Yiran Xu, Nicole Pagane, Francisco Lobo, Laurent Younes, Donald Geman, Luigi Marchionni. Divergence analysis with coarse coding of omics data across cancer phenotypes [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 4551. doi:10.1158/1538-7445.AM2017-4551
Collapse
Affiliation(s)
| | - Qian Qe
- 2Johns Hopkins University, Baltimore, MD
| | - Lanlan Ji
- 2Johns Hopkins University, Baltimore, MD
| | - Yiran Xu
- 2Johns Hopkins University, Baltimore, MD
| | | | - Francisco Lobo
- 3Federal University of Minas Gerais, Belo Horizonte, Brazil
| | | | | | | |
Collapse
|
13
|
Ament SA, Pearl JR, Grindeland A, St. Claire J, Earls JC, Kovalenko M, Gillis T, Mysore J, Gusella JF, Lee JM, Kwak S, Howland D, Lee MY, Baxter D, Scherler K, Wang K, Geman D, Carroll JB, MacDonald ME, Carlson G, Wheeler VC, Price ND, Hood LE. High resolution time-course mapping of early transcriptomic, molecular and cellular phenotypes in Huntington's disease CAG knock-in mice across multiple genetic backgrounds. Hum Mol Genet 2017; 26:913-922. [PMID: 28334820 PMCID: PMC6075528 DOI: 10.1093/hmg/ddx006] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2016] [Revised: 12/09/2016] [Accepted: 01/03/2017] [Indexed: 01/11/2023] Open
Abstract
Huntington's disease is a dominantly inherited neurodegenerative disease caused by the expansion of a CAG repeat in the HTT gene. In addition to the length of the CAG expansion, factors such as genetic background have been shown to contribute to the age at onset of neurological symptoms. A central challenge in understanding the disease progression that leads from the HD mutation to massive cell death in the striatum is the ability to characterize the subtle and early functional consequences of the CAG expansion longitudinally. We used dense time course sampling between 4 and 20 postnatal weeks to characterize early transcriptomic, molecular and cellular phenotypes in the striatum of six distinct knock-in mouse models of the HD mutation. We studied the effects of the HttQ111 allele on the C57BL/6J, CD-1, FVB/NCr1, and 129S2/SvPasCrl genetic backgrounds, and of two additional alleles, HttQ92 and HttQ50, on the C57BL/6J background. We describe the emergence of a transcriptomic signature in HttQ111/+ mice involving hundreds of differentially expressed genes and changes in diverse molecular pathways. We also show that this time course spanned the onset of mutant huntingtin nuclear localization phenotypes and somatic CAG-length instability in the striatum. Genetic background strongly influenced the magnitude and age at onset of these effects. This work provides a foundation for understanding the earliest transcriptional and molecular changes contributing to HD pathogenesis.
Collapse
Affiliation(s)
- Seth A. Ament
- Institute for Systems Biology, Seattle, WA, USA
- Institute for Genome Sciences and Department of Psychiatry, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Jocelynn R. Pearl
- Institute for Systems Biology, Seattle, WA, USA
- Molecular and Cellular Biology Graduate Program, University of Washington, Seattle, WA, USA
| | | | - Jason St. Claire
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - John C. Earls
- Institute for Systems Biology, Seattle, WA, USA
- Department of Computer Science, University of Washington, Seattle, WA, USA
| | - Marina Kovalenko
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - Tammy Gillis
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - Jayalakshmi Mysore
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - James F. Gusella
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - Jong-Min Lee
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - Seung Kwak
- CHDI Management/CHDI Foundation, Princeton, NJ, USA
| | | | | | | | | | - Kai Wang
- Institute for Systems Biology, Seattle, WA, USA
| | - Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
| | - Jeffrey B. Carroll
- Behavioral Neuroscience Program, Department of Psychology, Western Washington University, Bellingham, WA, USA
| | - Marcy E. MacDonald
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | | | - Vanessa C. Wheeler
- Center for Human Genetic Research, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA
| | | | | |
Collapse
|
14
|
Geman D. Confluent Brownian motions. ADV APPL PROBAB 2016. [DOI: 10.2307/1426583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
15
|
|
16
|
Marchionni L, Geman D. Abstract 3754: Predicting cancer phenotypes with mechanism-driven multi-omics data integration. Cancer Res 2015. [DOI: 10.1158/1538-7445.am2015-3754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Over the past decade technological advances have enabled molecular profiling of human cancers across distinct genomic domains and other “omes”. The availability of such multi-omics datasets has in turn enabled the discovery of cancer subtypes characterized by distinct molecular patterns within and across different data modalities. Despite promising beginnings and the wealth of data, most efforts so far have focused on the discovery of new molecular taxonomies, enumerating novel cancer subtypes, and only subsequently projecting them into a biological context by leveraging knowledge on genetic and epigenetic variations, genomic alterations, gene expression patterns, and, in general, cell pathophysiology.
A paradigmatic approach to omics-based cancer classification usually entails the i) discovery of novel molecular subtypes; (ii) the biological contextualization of such subtypes and their correlation with clinical phenotypes; and (iii) the development of predictors to detect these subtypes. Nevertheless, the direct clinical utility of such taxonomies is less evident. Some of the molecular subtypes, for instance, might not portend any different clinical behavior, or the underlying molecular pathways might not be actionable. Ultimately, existing biological knowledge enters the analysis only a posteriori to characterize and “label” the novel subtypes, rather than being leveraged a priori to guide the discovery process itself.
To overcome such nearly universal absence of mechanistic underpinnings for the omics-derived signatures and develop clinically useful biomarkers, we have proposed to develop mechanistic predictive models by incorporating gene network and signaling pathway information directly into the statistical learning process used to detect the cancer phenotypes. Unlike the paradigm described above, we used omics data and prior biological information to directly detect and predict the phenotypes. We now further extend this concept and leverage biological knowledge also to constrain multi-omics data integration, by implementing predictive rules that mechanistically aggregate measurements across distinct genomic modalities, reproducing the natural flow of biological information in the cell: from genome to phenotype, through epigenome, transcriptome and proteome.
To illustrate our approach and its impact on computational learning and cancer classification, we analyze clinically relevant cancer phenotypes using independent training and testing data. To this end we build our novel predictors using the Top Scoring Pair (TSP) algorithm, a two-gene parameter-free classifier, and its multi-pair extension kTSP. We then compare the classification performance of predictors derived from a single omics modality to those constructed by integrating multi-omics data according to mechanistic and biologically meaningful rules, revealing increased accuracy with the integrated classifiers.
Citation Format: Luigi Marchionni, Donald Geman. Predicting cancer phenotypes with mechanism-driven multi-omics data integration. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 3754. doi:10.1158/1538-7445.AM2015-3754
Collapse
|
17
|
Abstract
Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a "visual Turing test": an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question ("just-in-time truthing"). The test is then administered to the computer-vision system, one question at a time. After the system's answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers-the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects.
Collapse
Affiliation(s)
- Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21287; and
| | - Stuart Geman
- Division of Applied Mathematics, Brown University, Providence, RI 02912
| | - Neil Hallonquist
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21287; and
| | - Laurent Younes
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21287; and
| |
Collapse
|
18
|
Geman D, Ochs M, Price ND, Tomasetti C, Younes L. An argument for mechanism-based statistical inference in cancer. Hum Genet 2014; 134:479-95. [PMID: 25381197 DOI: 10.1007/s00439-014-1501-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Accepted: 10/14/2014] [Indexed: 01/07/2023]
Abstract
Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda—in particular, predicting disease phenotypes, progression and treatment response for individuals—requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.
Collapse
Affiliation(s)
- Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21210, USA,
| | | | | | | | | |
Collapse
|
19
|
Afsari B, Geman D, Fertig EJ. Learning dysregulated pathways in cancers from differential variability analysis. Cancer Inform 2014; 13:61-7. [PMID: 25392694 PMCID: PMC4218688 DOI: 10.4137/cin.s14066] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Revised: 08/13/2014] [Accepted: 08/14/2014] [Indexed: 12/16/2022] Open
Abstract
Analysis of gene sets can implicate activity in signaling pathways that is responsible for cancer initiation and progression, but is not discernible from the analysis of individual genes. Multiple methods and software packages have been developed to infer pathway activity from expression measurements for set of genes targeted by that pathway. Broadly, three major methodologies have been proposed: over-representation, enrichment, and differential variability. Both over-representation and enrichment analyses are effective techniques to infer differentially regulated pathways from gene sets with relatively consistent differentially expressed (DE) genes. Specifically, these algorithms aggregate statistics from each gene in the pathway. However, they overlook multivariate patterns related to gene interactions and variations in expression. Therefore, the analysis of differential variability of multigene expression patterns can be essential to pathway inference in cancers. The corresponding methodologies and software packages for such multivariate variability analysis of pathways are reviewed here. We also introduce a new, computationally efficient algorithm, expression variation analysis (EVA), which has been implemented along with a previously proposed algorithm, Differential Rank Conservation (DIRAC), in an open source R package, gene set regulation (GSReg). EVA inferred similar pathways as DIRAC at reduced computational costs. Moreover, EVA also inferred different dysregulated pathways than those identified by enrichment analysis.
Collapse
Affiliation(s)
- Bahman Afsari
- Postdoctoral Fellow, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Donald Geman
- Professor, Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
| | - Elana J Fertig
- Assistant Professor, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
20
|
Ma S, Sung J, Magis AT, Wang Y, Geman D, Price ND. Measuring the effect of inter-study variability on estimating prediction error. PLoS One 2014; 9:e110840. [PMID: 25330348 PMCID: PMC4201588 DOI: 10.1371/journal.pone.0110840] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Accepted: 09/18/2014] [Indexed: 11/19/2022] Open
Abstract
Background The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in “batch-effects”) and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. Methods Here we quantify the impact of these combined “study-effects” on a disease signature’s predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. Results As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. Conclusions We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when “sufficient” diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.
Collapse
Affiliation(s)
- Shuyi Ma
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Jaeyun Sung
- Institute for Systems Biology, Seattle, Washington, United States of America
- Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk, Republic of Korea
| | - Andrew T. Magis
- Institute for Systems Biology, Seattle, Washington, United States of America
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United States of America
| | - Yuliang Wang
- Institute for Systems Biology, Seattle, Washington, United States of America
- Sage Bionetworks, Seattle, Washington, United States of America
| | - Donald Geman
- Institute for Computational Medicine & Department of Applied Mathematics and Statistics, John Hopkins University, Baltimore, Maryland, United States of America
| | - Nathan D. Price
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
21
|
Afsari B, Fertig EJ, Younes L, Geman D, Marchionni L. Abstract 5342: Hardwiring mechanism into predicting cancer phenotypes by computational learning. Cancer Res 2014. [DOI: 10.1158/1538-7445.am2014-5342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Rationale. Despite promising beginnings, molecular classifiers derived from statistical learning do not yet appear to be sufficiently mature for clinical use. Besides known limitations, the nearly universal absence of mechanistic underpinnings for such signatures represents as major barrier toward successful implementation of clinically useful biomarkers. To overcome this limitation we constrained the search for predictive models to those with mechanistic justification, by incorporating microRNA (miR) and transcription factor (TF) gene regulatory networks directly into the learning process of cancer phenotypes.
Methods. To illustrate the impact of embedding such regulatory motifs into computational learning, we analyzed the ability to predict estrogen receptor (ER) status from transcriptional data. We applied this approach to two independent breast cancer studies used as training and validation sets respectively. This analysis provided a test case with well-characterized clinical attributes, in which the ER itself is a TF engaged in regulatory miR/TF motifs. We built our predictors using Top Scoring Pair (TSP), a two-gene parameter-free classifier returning one class (ER positive) or the other (ER negative) based on the relative ordering of the two genes. We compared classification performance between TSPs chosen from all possible gene pairs and TSPs constructed under network-based constraints - “random” and “mechanistic” TSPs respectively hereafter. Each “mechanistic” TSP consists of a gene pair: the first gene regulates a miR or a TF “hub”, which in turn regulates the second gene. We started from a network of 200 TFs, 373 miRs, and 2772 target genes based on regulatory information from the miRgen v2.0 and TarBase v5.0 databases.
Results. We assessed the classification accuracy of the TSP classifiers derived from the training dataset in the validation set and nearly all top-performing predictors were based on regulatory motifs. A Wilcoxon rank-sum test comparing the “random” classifiers with either TF or miR based TSPs had P-values of 10−14 and 10−26, respectively. Most of such top “mechanistic” predictors involved the ER gene (ERS1), consistent with the underlying biology. The mechanistic predictor also paired ERS1 expression with genes relevant to the biology. For instance, TSP selected POU2F1 _ a TF member of the POU family also known as OCT1 _ which physically interacts with the ER itself and BRCA1, recruiting BRCA1 to the ESR1 promoter modulating ER expression. Consistent with the classifier, BRCA1-mutant breast tumors are typically estrogen ER negative.
Conclusions. We have implemented a novel class of mechanistic predictors by ”hardwiring” gene regulatory network information into statistical learning of cancer phenotypes. This approach has intrinsic added value for knowledge discovery and disease treatment design, and will ultimately move the field towards a successful transition to personalized health care.
Citation Format: Bahman Afsari, Elana Judith Fertig, Laurent Younes, Donald Geman, Luigi Marchionni. Hardwiring mechanism into predicting cancer phenotypes by computational learning. [abstract]. In: Proceedings of the 105th Annual Meeting of the American Association for Cancer Research; 2014 Apr 5-9; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2014;74(19 Suppl):Abstract nr 5342. doi:10.1158/1538-7445.AM2014-5342
Collapse
|
22
|
Abstract
UNLABELLED k-Top Scoring Pairs (kTSP) is a classification method for prediction from high-throughput data based on a set of the paired measurements. Each of the two possible orderings of a pair of measurements (e.g. a reversal in the expression of two genes) is associated with one of two classes. The kTSP prediction rule is the aggregation of voting among such individual two-feature decision rules based on order switching. kTSP, like its predecessor, Top Scoring Pair (TSP), is a parameter-free classifier relying only on ranking of a small subset of features, rendering it robust to noise and potentially easy to interpret in biological terms. In contrast to TSP, kTSP has comparable accuracy to standard genomics classification techniques, including Support Vector Machines and Prediction Analysis for Microarrays. Here, we describe 'switchBox', an R package for kTSP-based prediction. AVAILABILITY The 'switchBox' package is freely available from Bioconductor: http://www.bioconductor.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahman Afsari
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Elana J Fertig
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Donald Geman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Luigi Marchionni
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
23
|
|
24
|
Simcha DM, Younes L, Aryee MJ, Geman D. Identification of direction in gene networks from expression and methylation. BMC Syst Biol 2013; 7:118. [PMID: 24182195 PMCID: PMC4228359 DOI: 10.1186/1752-0509-7-118] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2012] [Accepted: 10/17/2013] [Indexed: 01/27/2023]
Abstract
BACKGROUND Reverse-engineering gene regulatory networks from expression data is difficult, especially without temporal measurements or interventional experiments. In particular, the causal direction of an edge is generally not statistically identifiable, i.e., cannot be inferred as a statistical parameter, even from an unlimited amount of non-time series observational mRNA expression data. Some additional evidence is required and high-throughput methylation data can viewed as a natural multifactorial gene perturbation experiment. RESULTS We introduce IDEM (Identifying Direction from Expression and Methylation), a method for identifying the causal direction of edges by combining DNA methylation and mRNA transcription data. We describe the circumstances under which edge directions become identifiable and experiments with both real and synthetic data demonstrate that the accuracy of IDEM for inferring both edge placement and edge direction in gene regulatory networks is significantly improved relative to other methods. CONCLUSION Reverse-engineering directed gene regulatory networks from static observational data becomes feasible by exploiting the context provided by high-throughput DNA methylation data.An implementation of the algorithm described is available at http://code.google.com/p/idem/.
Collapse
Affiliation(s)
- David M Simcha
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.
| | | | | | | |
Collapse
|
25
|
Sung J, Kim PJ, Ma S, Funk CC, Magis AT, Wang Y, Hood L, Geman D, Price ND. Multi-study integration of brain cancer transcriptomes reveals organ-level molecular signatures. PLoS Comput Biol 2013; 9:e1003148. [PMID: 23935471 PMCID: PMC3723500 DOI: 10.1371/journal.pcbi.1003148] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2012] [Accepted: 06/05/2013] [Indexed: 12/23/2022] Open
Abstract
We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein – Identification of Structured Signatures and Classifiers (ISSAC) – that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood. From a multi-study, integrated transcriptomic dataset, we identified a marker panel for differentiating major human brain cancers at the gene-expression level. The ISSAC molecular signatures for brain cancers, composed of 44 unique genes, are based on comparing expression levels of pairs of genes, and phenotype prediction follows a diagnostic hierarchy. We found that sufficient dataset integration across multiple studies greatly enhanced diagnostic performance on truly independent validation sets, whereas signatures learned from only one dataset typically led to high error rate. Molecular signatures of brain cancers, when obtained using all currently available gene-expression data, achieved 90% phenotype prediction accuracy. Thus, our integrative approach holds significant promise for developing organ-level, comprehensive, molecular signatures of disease.
Collapse
Affiliation(s)
- Jaeyun Sung
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Pan-Jun Kim
- Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk, Republic of Korea
- Department of Physics, POSTECH, Pohang, Gyeongbuk, Republic of Korea
| | - Shuyi Ma
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Cory C. Funk
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Andrew T. Magis
- Institute for Systems Biology, Seattle, Washington, United States of America
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United States of America
| | - Yuliang Wang
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Leroy Hood
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Donald Geman
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Nathan D. Price
- Institute for Systems Biology, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
26
|
Abstract
Background A small number of prognostic and predictive tests based on gene expression are currently offered as reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been identified in other genomic-based predictors and the success rate for developing clinically useful genomic signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting of computational research. As a result, a need has emerged for a template for reproducible development of genomic signatures that incorporates full transparency, data sharing and statistical robustness. Results Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival. Conclusions Our study provides a prototypic example for reproducible development of computational algorithms for learning prognostic biomarkers in the era of personalized medicine.
Collapse
Affiliation(s)
- Luigi Marchionni
- The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, School of Medicine, Baltimore, MD 21231, USA
| | | | | | | |
Collapse
|
27
|
Abstract
Because of the inherent complexity of coupled nonlinear biological systems, the development of computational models is necessary for achieving a quantitative understanding of their structure and function in health and disease. Statistical learning is applied to high-dimensional biomolecular data to create models that describe relationships between molecules and networks. Multiscale modeling links networks to cells, organs, and organ systems. Computational approaches are used to characterize anatomic shape and its variations in health and disease. In each case, the purposes of modeling are to capture all that we know about disease and to develop improved therapies tailored to the needs of individuals. We discuss advances in computational medicine, with specific examples in the fields of cancer, diabetes, cardiology, and neurology. Advances in translating these computational methods to the clinic are described, as well as challenges in applying models for improving patient health.
Collapse
Affiliation(s)
- Raimond L Winslow
- The Institute for Computational Medicine, Center for Cardiovascular Bioinformatics and Modeling, and Department of Biomedical Engineering, The Johns Hopkins University School of Medicine, Baltimore, MD 21218, USA.
| | | | | | | |
Collapse
|
28
|
Sánchez-Vega F, Younes L, Geman D. Learning multivariate distributions by competitive assembly of marginals. IEEE Trans Pattern Anal Mach Intell 2013; 35:398-410. [PMID: 22529323 DOI: 10.1109/tpami.2012.96] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or "primitives," which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a Lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.
Collapse
Affiliation(s)
- Francisco Sánchez-Vega
- Department of Applied Mathematics and Statistics, Center for Imaging Science and Institute for Computational Medicine, Johns Hopkins University, Clark Hall, 3400 N. Charles St., Baltimore, MD 21218, USA.
| | | | | |
Collapse
|
29
|
Abstract
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
Collapse
Affiliation(s)
- David Simcha
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
| | | | | |
Collapse
|
30
|
Abstract
Protein signaling networks play a central role in transcriptional regulation and the etiology of many diseases. Statistical methods, particularly Bayesian networks, have been widely used to model cell signaling, mostly for model organisms and with focus on uncovering connectivity rather than inferring aberrations. Extensions to mammalian systems have not yielded compelling results, due likely to greatly increased complexity and limited proteomic measurements in vivo. In this study, we propose a comprehensive statistical model that is anchored to a predefined core topology, has a limited complexity due to parameter sharing and uses microarray data of mRNA transcripts as the only observable components of signaling. Specifically, we account for cell heterogeneity and a multilevel process, representing signaling as a Bayesian network at the cell level, modeling measurements as ensemble averages at the tissue level, and incorporating patient-to-patient differences at the population level. Motivated by the goal of identifying individual protein abnormalities as potential therapeutical targets, we applied our method to the RAS-RAF network using a breast cancer study with 118 patients. We demonstrated rigorous statistical inference, established reproducibility through simulations and the ability to recover receptor status from available microarray data.
Collapse
Affiliation(s)
- Erdem Yörük
- Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD 21218, USA.
| | | | | | | |
Collapse
|
31
|
Abstract
Histone modifications are fundamental to chromatin structure and transcriptional regulation, and are recognized by a limited number of protein folds. Among these folds are PHD fingers, which are present in most chromatin modification complexes. To date, about 15 PHD finger domains have been structurally characterized, whereas hundreds of different sequences have been identified. Consequently, an important open problem is to predict structural features of a PHD finger knowing only its sequence. Here, we classify PHD fingers into different groups based on the analysis of residue–residue co-evolution in their sequences. We measure the degree to which fixing the amino acid type at one position modifies the frequencies of amino acids at other positions. We then detect those position/amino acid combinations, or ‘conditions’, which have the strongest impact on other sequence positions. Clustering these strong conditions yields four families, providing informative labels for PHD finger sequences. Existing experimental results, as well as docking calculations performed here, reveal that these families indeed show discrepancies at the functional level. Our method should facilitate the functional characterization of new PHD fingers, as well as other protein families, solely based on sequence information.
Collapse
Affiliation(s)
- Patrick Slama
- Institute for Computational Medicine and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA.
| | | |
Collapse
|
32
|
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 2010; 11:733-9. [PMID: 20838408 DOI: 10.1038/nrg2825] [Citation(s) in RCA: 1253] [Impact Index Per Article: 89.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.
Collapse
Affiliation(s)
- Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205-2179, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Abstract
The enormous amount of biomolecule measurement data generated from high-throughput technologies has brought an increased need for computational tools in biological analyses. Such tools can enhance our understanding of human health and genetic diseases, such as cancer, by accurately classifying phenotypes, detecting the presence of disease, discriminating among cancer sub-types, predicting clinical outcomes, and characterizing disease progression. In the case of gene expression microarray data, standard statistical learning methods have been used to identify classifiers that can accurately distinguish disease phenotypes. However, these mathematical prediction rules are often highly complex, and they lack the convenience and simplicity desired for extracting underlying biological meaning or transitioning into the clinic. In this review, we survey a powerful collection of computational methods for analyzing transcriptomic microarray data that address these limitations. Relative Expression Analysis (RXA) is based only on the relative orderings among the expressions of a small number of genes. Specifically, we provide a description of the first and simplest example of RXA, the K-TSP classifier, which is based on _ pairs of genes; the case K = 1 is the TSP classifier. Given their simplicity and ease of biological interpretation, as well as their invariance to data normalization and parameter-fitting, these classifiers have been widely applied in aiding molecular diagnostics in a broad range of human cancers. We review several studies which demonstrate accurate classification of disease phenotypes (e.g., cancer vs. normal), cancer subclasses (e.g., AML vs. ALL, GIST vs. LMS), disease outcomes (e.g., metastasis, survival), and diverse human pathologies assayed through blood-borne leukocytes. The studies presented demonstrate that RXA-specifically the TSP and K-TSP classifiers-is a promising new class of computational methods for analyzing high-throughput data, and has the potential to significantly contribute to molecular cancer diagnosis and prognosis.
Collapse
Affiliation(s)
- James A Eddy
- Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| | | | | | | |
Collapse
|
34
|
Eddy JA, Hood L, Price ND, Geman D. Identifying tightly regulated and variably expressed networks by Differential Rank Conservation (DIRAC). PLoS Comput Biol 2010; 6:e1000792. [PMID: 20523739 PMCID: PMC2877722 DOI: 10.1371/journal.pcbi.1000792] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2009] [Accepted: 04/22/2010] [Indexed: 12/18/2022] Open
Abstract
A powerful way to separate signal from noise in biology is to convert the molecular data from individual genes or proteins into an analysis of comparative biological network behaviors. One of the limitations of previous network analyses is that they do not take into account the combinatorial nature of gene interactions within the network. We report here a new technique, Differential Rank Conservation (DIRAC), which permits one to assess these combinatorial interactions to quantify various biological pathways or networks in a comparative sense, and to determine how they change in different individuals experiencing the same disease process. This approach is based on the relative expression values of participating genes—i.e., the ordering of expression within network profiles. DIRAC provides quantitative measures of how network rankings differ either among networks for a selected phenotype or among phenotypes for a selected network. We examined disease phenotypes including cancer subtypes and neurological disorders and identified networks that are tightly regulated, as defined by high conservation of transcript ordering. Interestingly, we observed a strong trend to looser network regulation in more malignant phenotypes and later stages of disease. At a sample level, DIRAC can detect a change in ranking between phenotypes for any selected network. Variably expressed networks represent statistically robust differences between disease states and serve as signatures for accurate molecular classification, validating the information about expression patterns captured by DIRAC. Importantly, DIRAC can be applied not only to transcriptomic data, but to any ordinal data type. The systems approach to medicine derives from the idea that diseased cells arise from one or more perturbed biological networks due to the net effect of interactions among multiple molecular agents; by measuring differences in the abundance of biomolecules (e.g., mRNA, proteins, metabolites) we can identify reporters of network states and uncover molecular signatures of disease. However, a major limitation of previously published network analyses is the focus on small numbers of individual, differentially-expressed genes, hence the failure to take into account combinatorial interactions. We report a new technique, Differential Rank Conservation, for identifying and measuring network-level perturbations. Our rank conservation index is based entirely on the relative levels of expression for participating genes and allows us to detect differences in network orderings between networks for a given phenotype and between phenotypes for a given network. In examining cancer subtypes and neurological disorders, we identified networks that are tightly and loosely regulated, as defined by the level of conservation of transcript ordering, and observed a strong trend to looser network regulation in more malignant phenotypes and later stages of disease. We also demonstrate that variably expressed networks represent robust differences between disease states.
Collapse
Affiliation(s)
- James A. Eddy
- Institute for Genomic Biology, University of Illinois, Urbana, Illinois, United States of America
- Department of Bioengineering, University of Illinois, Urbana, Illinois, United States of America
| | - Leroy Hood
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Nathan D. Price
- Institute for Genomic Biology, University of Illinois, Urbana, Illinois, United States of America
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
- * E-mail:
| | - Donald Geman
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
35
|
|
36
|
Abstract
The computational identification from global data sets of stable and predictive patterns of gene and protein relative expression reversals offers a simple, yet powerful approach to target therapies for personalized medicine and to identify pathways that are disease-perturbed. We previously utilized this approach to identify a molecular classifier with near 100% accuracy for differentiating gastrointestinal stromal tumor (GIST) and leiomyosarcoma (LMS), two cancers that have very similar histopathology, but require very different treatments. Differential Rank Conservation (DIRAC) is a novel approach for studying gene ordering within pathways and is based on the relative expression ranks of participating genes. DIRAC provides quantitative measures of how pathway rankings differ both within and between phenotypes. DIRAC between pathways in a selected phenotype contrasts the scenarios where either (i) pathways are ranked similarly in all samples; or (ii) the ordering of pathway genes is highly varied. We examined gene expression in GIST and LMS tumor profiles and identified pathways that appear to be tightly regulated based on high conservation of gene ordering. The second form of DIRAC manifests as a change in ranking (i.e., shuffling) between phenotypes for a selected pathway. These variably expressed pathways serve as signatures for molecular classification, and the ability to accurately classify microarray samples provided strong validation for the pathway-level expression differences identified by DIRAC.
Collapse
Affiliation(s)
- James A Eddy
- Department of Bioengineering, University of Illinois, Urbana, IL 61801 USA.
| | | | | |
Collapse
|
37
|
Edelman LB, Toia G, Geman D, Zhang W, Price ND. Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases. BMC Genomics 2009; 10:583. [PMID: 19961616 PMCID: PMC2797819 DOI: 10.1186/1471-2164-10-583] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2009] [Accepted: 12/05/2009] [Indexed: 11/15/2022] Open
Abstract
Background Identification of molecular classifiers from genome-wide gene expression analysis is an important practice for the investigation of biological systems in the post-genomic era - and one with great potential for near-term clinical impact. The 'Top-Scoring Pair' (TSP) classification method identifies pairs of genes whose relative expression correlates strongly with phenotype. In this study, we sought to assess the effectiveness of the TSP approach in the identification of diagnostic classifiers for a number of human diseases including bacterial and viral infection, cardiomyopathy, diabetes, Crohn's disease, and transformed ulcerative colitis. We examined transcriptional profiles from both solid tissues and blood-borne leukocytes. Results The algorithm identified multiple predictive gene pairs for each phenotype, with cross-validation accuracy ranging from 70 to nearly 100 percent, and high sensitivity and specificity observed in most classification tasks. Performance compared favourably with that of pre-existing transcription-based classifiers, and in some cases was comparable to the accuracy of current clinical diagnostic procedures. Several diseases of solid tissues could be reliably diagnosed through classifiers based on the blood-borne leukocyte transcriptome. The TSP classifier thus represents a simple yet robust method to differentiate between diverse phenotypic states based on gene expression profiles. Conclusion Two-transcript classifiers have the potential to reliably classify diverse human diseases, through analysis of both local diseased tissue and the immunological response assayed through blood-borne leukocytes. The experimental simplicity of this method results in measurements that can be easily translated to clinical practice.
Collapse
Affiliation(s)
- Lucas B Edelman
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | | | | | | | | |
Collapse
|
38
|
Lin X, Afsari B, Marchionni L, Cope L, Parmigiani G, Naiman D, Geman D. The ordering of expression among a few genes can provide simple cancer biomarkers and signal BRCA1 mutations. BMC Bioinformatics 2009; 10:256. [PMID: 19695104 PMCID: PMC2745389 DOI: 10.1186/1471-2105-10-256] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2009] [Accepted: 08/20/2009] [Indexed: 11/11/2022] Open
Abstract
Background A major challenge in computational biology is to extract knowledge about the genetic nature of disease from high-throughput data. However, an important obstacle to both biological understanding and clinical applications is the "black box" nature of the decision rules provided by most machine learning approaches, which usually involve many genes combined in a highly complex fashion. Achieving biologically relevant results argues for a different strategy. A promising alternative is to base prediction entirely upon the relative expression ordering of a small number of genes. Results We present a three-gene version of "relative expression analysis" (RXA), a rigorous and systematic comparison with earlier approaches in a variety of cancer studies, a clinically relevant application to predicting germline BRCA1 mutations in breast cancer and a cross-study validation for predicting ER status. In the BRCA1 study, RXA yields high accuracy with a simple decision rule: in tumors carrying mutations, the expression of a "reference gene" falls between the expression of two differentially expressed genes, PPP1CB and RNF14. An analysis of the protein-protein interactions among the triplet of genes and BRCA1 suggests that the classifier has a biological foundation. Conclusion RXA has the potential to identify genomic "marker interactions" with plausible biological interpretation and direct clinical applicability. It provides a general framework for understanding the roles of the genes involved in decision rules, as illustrated for the difficult and clinically relevant problem of identifying BRCA1 mutation carriers.
Collapse
Affiliation(s)
- Xue Lin
- Department of Applied Mathematics and Statistics, The Johns Hopkins University, Baltimore, Maryland, USA.
| | | | | | | | | | | | | |
Collapse
|
39
|
Abstract
Starting from a member of an image database designated the "query image," traditional image retrieval techniques, for example, search by visual similarity, allow one to locate additional instances of a target category residing in the database. However, in many cases, the query image or, more generally, the target category, resides only in the mind of the user as a set of subjective visual patterns, psychological impressions, or "mental pictures." Consequently, since image databases available today are often unstructured and lack reliable semantic annotations, it is often not obvious how to initiate a search session; this is the "page zero problem." We propose a new statistical framework based on relevance feedback to locate an instance of a semantic category in an unstructured image database with no semantic annotation. A search session is initiated from a random sample of images. At each retrieval round, the user is asked to select one image from among a set of displayed images-the one that is closest in his opinion to the target class. The matching is then "mental." Performance is measured by the number of iterations necessary to display an image which satisfies the user, at which point standard techniques can be employed to display other instances. Our core contribution is a Bayesian formulation which scales to large databases. The two key components are a response model which accounts for the user's subjective perception of similarity and a display algorithm which seeks to maximize the flow of information. Experiments with real users and two databases of 20,000 and 60,000 images demonstrate the efficiency of the search process.
Collapse
Affiliation(s)
- Marin Ferecatu
- TSI Department, Institut Telecom, Telecom Paristech, 46, rue Barrault, 75634 Paris, France.
| | | |
Collapse
|
40
|
Wang JZ, Geman D, Luo J, Gray RM. Real-world image annotation and retrieval: an introduction to the special section. IEEE Trans Pattern Anal Mach Intell 2008; 30:1873-1876. [PMID: 19791313 DOI: 10.1109/tpami.2008.231] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Affiliation(s)
- James Z Wang
- College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16803, USA.
| | | | | | | |
Collapse
|
41
|
Xu L, Geman D, Winslow RL. Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics 2007; 8:275. [PMID: 17663766 PMCID: PMC1950528 DOI: 10.1186/1471-2105-8-275] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2007] [Accepted: 07/30/2007] [Indexed: 11/15/2022] Open
Abstract
Background There is a continuing need to develop molecular diagnostic tools which complement histopathologic examination to increase the accuracy of cancer diagnosis. DNA microarrays provide a means for measuring gene expression signatures which can then be used as components of genomic-based diagnostic tests to determine the presence of cancer. Results In this study, we collect and integrate ~ 1500 microarray gene expression profiles from 26 published cancer data sets across 21 major human cancer types. We then apply a statistical method, referred to as the Top-Scoring Pair of Groups (TSPG) classifier, and a repeated random sampling strategy to the integrated training data sets and identify a common cancer signature consisting of 46 genes. These 46 genes are naturally divided into two distinct groups; those in one group are typically expressed less than those in the other group for cancer tissues. Given a new expression profile, the classifier discriminates cancer from normal tissues by ranking the expression values of the 46 genes in the cancer signature and comparing the average ranks of the two groups. This signature is then validated by applying this decision rule to independent test data. Conclusion By combining the TSPG method and repeated random sampling, a robust common cancer signature has been identified from large-scale microarray data integration. Upon further validation, this signature may be useful as a robust and objective diagnostic test for cancer.
Collapse
Affiliation(s)
- Lei Xu
- The Institute for Computational Medicine and Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Donald Geman
- The Institute for Computational Medicine and Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Applied Mathematics and Statistics and Center for Imaging Sciences, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Raimond L Winslow
- The Institute for Computational Medicine and Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
42
|
Anderson TJ, Tchernyshyov I, Diez R, Cole RN, Geman D, Dang CV, Winslow RL. Discovering robust protein biomarkers for disease from relative expression reversals in 2-D DIGE data. Proteomics 2007; 7:1197-207. [PMID: 17366473 DOI: 10.1002/pmic.200600374] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This study assesses the ability of a novel family of machine learning algorithms to identify changes in relative protein expression levels, measured using 2-D DIGE data, which support accurate class prediction. The analysis was done using a training set of 36 total cellular lysates comprised of six normal and three cancer biological replicates (the remaining are technical replicates) and a validation set of four normal and two cancer samples. Protein samples were separated by 2-D DIGE and expression was quantified using DeCyder-2D Differential Analysis Software. The relative expression reversal (RER) classifier correctly classified 9/9 training biological samples (p<0.022) as estimated using a modified version of leave one out cross validation and 6/6 validation samples. The classification rule involved comparison of expression levels for a single pair of protein spots, tropomyosin isoforms and alpha-enolase, both of which have prior association as potential biomarkers in cancer. The data was also analyzed using algorithms similar to those found in the extended data analysis package of DeCyder software. We propose that by accounting for sources of within- and between-gel variation, RER classifiers applied to 2-D DIGE data provide a useful approach for identifying biomarkers that discriminate among protein samples of interest.
Collapse
Affiliation(s)
- Troy J Anderson
- Center for Cardiovascular Bioinformatics and Modeling and The Institute of Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, USA.
| | | | | | | | | | | | | |
Collapse
|
43
|
Abstract
MOTIVATION DNA microarray data analysis has been used previously to identify marker genes which discriminate cancer from normal samples. However, due to the limited sample size of each study, there are few common markers among different studies of the same cancer. With the rapid accumulation of microarray data, it is of great interest to integrate inter-study microarray data to increase sample size, which could lead to the discovery of more reliable markers. RESULTS We present a novel, simple method of integrating different microarray datasets to identify marker genes and apply the method to prostate cancer datasets. In this study, by applying a new statistical method, referred to as the top-scoring pair (TSP) classifier, we have identified a pair of robust marker genes (HPN and STAT6) by integrating microarray datasets from three different prostate cancer studies. Cross-platform validation shows that the TSP classifier built from the marker gene pair, which simply compares relative expression values, achieves high accuracy, sensitivity and specificity on independent datasets generated using various array platforms. Our findings suggest a new model for the discovery of marker genes from accumulated microarray data and demonstrate how the great wealth of microarray data can be exploited to increase the power of statistical analysis. CONTACT leixu@jhu.edu.
Collapse
Affiliation(s)
- Lei Xu
- The Whitaker Biomedical Engineering Institute, The Johns Hopkins University, Baltimore, MD 21218, USA.
| | | | | | | | | |
Collapse
|
44
|
Abstract
MOTIVATION Various studies have shown that cancer tissue samples can be successfully detected and classified by their gene expression patterns using machine learning approaches. One of the challenges in applying these techniques for classifying gene expression data is to extract accurate, readily interpretable rules providing biological insight as to how classification is performed. Current methods generate classifiers that are accurate but difficult to interpret. This is the trade-off between credibility and comprehensibility of the classifiers. Here, we introduce a new classifier in order to address these problems. It is referred to as k-TSP (k-Top Scoring Pairs) and is based on the concept of 'relative expression reversals'. This method generates simple and accurate decision rules that only involve a small number of gene-to-gene expression comparisons, thereby facilitating follow-up studies. RESULTS In this study, we have compared our approach to other machine learning techniques for class prediction in 19 binary and multi-class gene expression datasets involving human cancers. The k-TSP classifier performs as efficiently as Prediction Analysis of Microarray and support vector machine, and outperforms other learning methods (decision trees, k-nearest neighbour and naïve Bayes). Our approach is easy to interpret as the classifier involves only a small number of informative genes. For these reasons, we consider the k-TSP method to be a useful tool for cancer classification from microarray gene expression data. AVAILABILITY The software and datasets are available at http://www.ccbm.jhu.edu CONTACT actan@jhu.edu.
Collapse
Affiliation(s)
- Aik Choon Tan
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute, Baltimore, MD 21218, USA.
| | | | | | | | | |
Collapse
|
45
|
|
46
|
|
47
|
Abstract
Multiclass shape detection, in the sense of recognizing and localizing instances from multiple shape classes, is formulated as a two-step process in which local indexing primes global interpretation. During indexing a list of instantiations (shape identities and poses) is compiled, constrained only by no missed detections at the expense of false positives. Global information, such as expected relationships among poses, is incorporated afterward to remove ambiguities. This division is motivated by computational efficiency. In addition, indexing itself is organized as a coarse-to-fine search simultaneously in class and pose. This search can be interpreted as successive approximations to likelihood ratio tests arising from a simple ("naive Bayes") statistical model for the edge maps extracted from the original images. The key to constructing efficient "hypothesis tests" for multiple classes and poses is local ORing; in particular, spread edges provide imprecise but common and locally invariant features. Natural tradeoffs then emerge between discrimination and the pattern of spreading. These are analyzed mathematically within the model-based framework and the whole procedure is illustrated by experiments in reading license plates.
Collapse
Affiliation(s)
- Yali Amit
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA.
| | | | | |
Collapse
|
48
|
Abstract
We present a new approach to molecular classification based on mRNA comparisons. Our method, referred to as the top-scoring pair(s) (TSP) classifier, is motivated by current technical and practical limitations in using gene expression microarray data for class prediction, for example to detect disease, identify tumors or predict treatment response. Accurate statistical inference from such data is difficult due to the small number of observations, typically tens, relative to the large number of genes, typically thousands. Moreover, conventional methods from machine learning lead to decisions which are usually very difficult to interpret in simple or biologically meaningful terms. In contrast, the TSP classifier provides decision rules which i) involve very few genes and only relative expression values (e.g., comparing the mRNA counts within a single pair of genes); ii) are both accurate and transparent; and iii) provide specific hypotheses for follow-up studies. In particular, the TSP classifier achieves prediction rates with standard cancer data that are as high as those of previous studies which use considerably more genes and complex procedures. Finally, the TSP classifier is parameter-free, thus avoiding the type of over-fitting and inflated estimates of performance that result when all aspects of learning a predictor are not properly cross-validated.
Collapse
Affiliation(s)
- Donald Geman
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Applied Mathematics and Statistics, Johns Hopkins University,
| | - Christian d'Avignon
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Biomedical Engineering, Johns Hopkins University,
| | - Daniel Q. Naiman
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Applied Mathematics and Statistics, Johns Hopkins University,
| | - Raimond L. Winslow
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute, and Department of Biomedical Engineering, Johns Hopkins University,
| |
Collapse
|
49
|
|
50
|
Abstract
We propose a computational model for detecting and localizing instances from an object class in static gray-level images. We divide detection into visual selection and final classification, concentrating on the former: drastically reducing the number of candidate regions that require further, usually more intensive, processing, but with a minimum of computation and missed detections. Bottom-up processing is based on local groupings of edge fragments constrained by loose geometrical relationships. They have no a priori semantic or geometric interpretation. The role of training is to select special groupings that are moderately likely at certain places on the object but rate in the background. We show that the statistics in both populations are stable. The candidate regions are those that contain global arrangements of several local groupings. Whereas our model was not conceived to explain brain functions, it does cohere with evidence about the functions of neurons in V1 and V2, such as responses to coarse or incomplete patterns (e.g., illusory contours) and to scale and translation invariance in IT. Finally, the algorithm is applied to face and symbol detection.
Collapse
Affiliation(s)
- Y Amit
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA
| | | |
Collapse
|