1
|
Rosenberger G, Li W, Turunen M, He J, Subramaniam PS, Pampou S, Griffin AT, Karan C, Kerwin P, Murray D, Honig B, Liu Y, Califano A. Network-based elucidation of colon cancer drug resistance mechanisms by phosphoproteomic time-series analysis. Nat Commun 2024; 15:3909. [PMID: 38724493 PMCID: PMC11082183 DOI: 10.1038/s41467-024-47957-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Accepted: 04/16/2024] [Indexed: 05/12/2024] Open
Abstract
Aberrant signaling pathway activity is a hallmark of tumorigenesis and progression, which has guided targeted inhibitor design for over 30 years. Yet, adaptive resistance mechanisms, induced by rapid, context-specific signaling network rewiring, continue to challenge therapeutic efficacy. Leveraging progress in proteomic technologies and network-based methodologies, we introduce Virtual Enrichment-based Signaling Protein-activity Analysis (VESPA)-an algorithm designed to elucidate mechanisms of cell response and adaptation to drug perturbations-and use it to analyze 7-point phosphoproteomic time series from colorectal cancer cells treated with clinically-relevant inhibitors and control media. Interrogating tumor-specific enzyme/substrate interactions accurately infers kinase and phosphatase activity, based on their substrate phosphorylation state, effectively accounting for signal crosstalk and sparse phosphoproteome coverage. The analysis elucidates time-dependent signaling pathway response to each drug perturbation and, more importantly, cell adaptive response and rewiring, experimentally confirmed by CRISPR knock-out assays, suggesting broad applicability to cancer and other diseases.
Collapse
Affiliation(s)
- George Rosenberger
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Wenxue Li
- Yale Cancer Biology Institute, Yale University, West Haven, CT, USA
| | - Mikko Turunen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Jing He
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Regeneron Genetics Center, Tarrytown, NY, USA
| | - Prem S Subramaniam
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Sergey Pampou
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- J.P. Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Aaron T Griffin
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Medical Scientist Training Program, Columbia University Irving Medical Center, New York, NY, USA
| | - Charles Karan
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- J.P. Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Patrick Kerwin
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
- Department of Biochemistry & Molecular Biophysics, Columbia University Irving Medical Center, New York, NY, USA
- Zuckerman Mind Brain and Behavior Institute, Columbia University, New York, NY, USA
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Yansheng Liu
- Yale Cancer Biology Institute, Yale University, West Haven, CT, USA.
- Department of Pharmacology, Yale University School of Medicine, New Haven, CT, USA.
| | - Andrea Califano
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA.
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA.
- Department of Biochemistry & Molecular Biophysics, Columbia University Irving Medical Center, New York, NY, USA.
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
- Chan Zuckerberg Biohub New York, New York, NY, USA.
| |
Collapse
|
2
|
Rosenberger G, Li W, Turunen M, He J, Subramaniam PS, Pampou S, Griffin AT, Karan C, Kerwin P, Murray D, Honig B, Liu Y, Califano A. Network-based elucidation of colon cancer drug resistance by phosphoproteomic time-series analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.15.528736. [PMID: 36824919 PMCID: PMC9949144 DOI: 10.1101/2023.02.15.528736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
Abstract
Aberrant signaling pathway activity is a hallmark of tumorigenesis and progression, which has guided targeted inhibitor design for over 30 years. Yet, adaptive resistance mechanisms, induced by rapid, context-specific signaling network rewiring, continue to challenge therapeutic efficacy. By leveraging progress in proteomic technologies and network-based methodologies, over the past decade, we developed VESPA-an algorithm designed to elucidate mechanisms of cell response and adaptation to drug perturbations-and used it to analyze 7-point phosphoproteomic time series from colorectal cancer cells treated with clinically-relevant inhibitors and control media. Interrogation of tumor-specific enzyme/substrate interactions accurately inferred kinase and phosphatase activity, based on their inferred substrate phosphorylation state, effectively accounting for signal cross-talk and sparse phosphoproteome coverage. The analysis elucidated time-dependent signaling pathway response to each drug perturbation and, more importantly, cell adaptive response and rewiring that was experimentally confirmed by CRISPRko assays, suggesting broad applicability to cancer and other diseases.
Collapse
Affiliation(s)
- George Rosenberger
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Wenxue Li
- Yale Cancer Biology Institute, Yale University, West Haven, CT, USA
| | - Mikko Turunen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Jing He
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Present address: Regeneron Genetics Center, Tarrytown, NY, USA
| | - Prem S Subramaniam
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Sergey Pampou
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- J.P. Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Aaron T Griffin
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Medical Scientist Training Program, Columbia University Irving Medical Center, New York, NY, USA
| | - Charles Karan
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- J.P. Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Patrick Kerwin
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Yansheng Liu
- Yale Cancer Biology Institute, Yale University, West Haven, CT, USA
- Department of Pharmacology, Yale University School of Medicine, New Haven, CT, USA
| | - Andrea Califano
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- J.P. Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY, USA
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
- Department of Biochemistry & Molecular Biophysics, Columbia University Irving Medical Center, New York, NY, USA
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
| |
Collapse
|
3
|
Taylor S, Ponzini M, Wilson M, Kim K. Comparison of imputation and imputation-free methods for statistical analysis of mass spectrometry data with missing data. Brief Bioinform 2021; 23:6361033. [PMID: 34472591 DOI: 10.1093/bib/bbab353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 07/27/2021] [Accepted: 08/10/2021] [Indexed: 11/14/2022] Open
Abstract
Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values and apply statistical methods that require complete data and (ii) use statistical methods that specifically account for missing values without imputation (imputation-free methods). This study reviews the effect of sample size and percentage of missing values on statistical inference for multiple methods under these two strategies. With increasing missingness, the ability of imputation and imputation-free methods to identify differentially and non-differentially regulated compounds in a two-group comparison study declined. Random forest and k-nearest neighbor imputation combined with a Wilcoxon test performed well in statistical testing for up to 50% missingness with little bias in estimating the effect size. Quantile regression imputation accompanied with a Wilcoxon test also had good statistical testing outcomes but substantially distorted the difference in means between groups. None of the imputation-free methods performed consistently better for statistical testing than imputation methods.
Collapse
Affiliation(s)
- Sandra Taylor
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| | - Matthew Ponzini
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| | - Machelle Wilson
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| | - Kyoungmi Kim
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| |
Collapse
|
4
|
Li Q, Fisher K, Meng W, Fang B, Welsh E, Haura EB, Koomen JM, Eschrich SA, Fridley BL, Chen YA. GMSimpute: a generalized two-step Lasso approach to impute missing values in label-free mass spectrum analysis. Bioinformatics 2020; 36:257-263. [PMID: 31199438 PMCID: PMC6956786 DOI: 10.1093/bioinformatics/btz488] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 05/06/2019] [Accepted: 06/10/2019] [Indexed: 12/16/2022] Open
Abstract
Motivation Missingness in label-free mass spectrometry is inherent to the technology. A computational approach to recover missing values in metabolomics and proteomics datasets is important. Most existing methods are designed under a particular assumption, either missing at random or under the detection limit. If the missing pattern deviates from the assumption, it may lead to biased results. Hence, we investigate the missing patterns in free mass spectrometry data and develop an omnibus approach GMSimpute, to allow effective imputation accommodating different missing patterns. Results Three proteomics datasets and one metabolomics dataset indicate missing values could be a mixture of abundance-dependent and abundance-independent missingness. We assess the performance of GMSimpute using simulated data (with a wide range of 80 missing patterns) and metabolomics data from the Cancer Genome Atlas breast cancer and clear cell renal cell carcinoma studies. Using Pearson correlation and normalized root mean square errors between the true and imputed abundance, we compare its performance to K-nearest neighbors’ type approaches, Random Forest, GSimp, a model-based method implemented in DanteR and minimum values. The results indicate GMSimpute provides higher accuracy in imputation and exhibits stable performance across different missing patterns. In addition, GMSimpute is able to identify the features in downstream differential expression analysis with high accuracy when applied to the Cancer Genome Atlas datasets. Availability and implementation GMSimpute is on CRAN: https://cran.r-project.org/web/packages/GMSimpute/index.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qian Li
- Health Informatics Institute, University of South Florida, Tampa, FL, USA
| | - Kate Fisher
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA.,Department of Biostatistics, IDDI Inc., Raleigh, NC, USA
| | - Wenjun Meng
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Bin Fang
- Proteomics and Metabolomics Core Facility, Moffitt Cancer Center, Tampa, FL, USA
| | - Eric Welsh
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Eric B Haura
- Department of Thoracic Oncology, Moffitt Cancer Center, Tampa, FL, USA
| | - John M Koomen
- Department of Molecular Oncology, Moffitt Cancer Center, Tampa, FL, USA
| | - Steven A Eschrich
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Y Ann Chen
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| |
Collapse
|
5
|
Niedzwiecki MM, Walker DI, Howell JC, Watts KD, Jones DP, Miller GW, Hu WT. High-resolution metabolomic profiling of Alzheimer's disease in plasma. Ann Clin Transl Neurol 2019; 7:36-45. [PMID: 31828981 PMCID: PMC6952314 DOI: 10.1002/acn3.50956] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 11/08/2019] [Accepted: 11/09/2019] [Indexed: 12/13/2022] Open
Abstract
Background Alzheimer’s disease (AD) is a complex neurological disorder with contributions from genetic and environmental factors. High‐resolution metabolomics (HRM) has the potential to identify novel endogenous and environmental factors involved in AD. Previous metabolomics studies have identified circulating metabolites linked to AD, but lack of replication and inconsistent diagnostic algorithms have hindered the generalizability of these findings. Here we applied HRM to identify plasma metabolic and environmental factors associated with AD in two study samples, with cerebrospinal fluid (CSF) biomarkers of AD incorporated to achieve high diagnostic accuracy. Methods Liquid chromatography‐mass spectrometry (LC–MS)‐based HRM was used to identify plasma and CSF metabolites associated with AD diagnosis and CSF AD biomarkers in two studies of prevalent AD (Study 1: 43 AD cases, 45 mild cognitive impairment [MCI] cases, 41 controls; Study 2: 50 AD cases, 18 controls). AD‐associated metabolites were identified using a metabolome‐wide association study (MWAS) framework. Results An MWAS meta‐analysis identified three non‐medication AD‐associated metabolites in plasma, including elevated levels of glutamine and an unknown halogenated compound and lower levels of piperine, a dietary alkaloid. The non‐medication metabolites were correlated with CSF AD biomarkers, and glutamine and the unknown halogenated compound were also detected in CSF. Furthermore, in Study 1, the unknown compound and piperine were altered in MCI patients in the same direction as AD dementia. Conclusions In plasma, AD was reproducibly associated with elevated levels of glutamine and a halogen‐containing compound and reduced levels of piperine. These findings provide further evidence that exposures and behavior may modify AD risks.
Collapse
Affiliation(s)
- Megan M Niedzwiecki
- Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, Georgia.,Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Douglas I Walker
- Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, Georgia.,Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, New York.,Clinical Biomarkers Laboratory, Division of Pulmonary, Allergy, Critical Care and Sleep Medicine, Emory University, Atlanta, Georgia
| | | | - Kelly D Watts
- Department of Neurology, Emory University, Atlanta, Georgia
| | - Dean P Jones
- Clinical Biomarkers Laboratory, Division of Pulmonary, Allergy, Critical Care and Sleep Medicine, Emory University, Atlanta, Georgia
| | - Gary W Miller
- Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, Georgia.,Department of Neurology, Emory University, Atlanta, Georgia.,Center for Neurodegenerative Diseases, Emory University, Atlanta, Georgia.,Department of Pharmacology, Emory University, Atlanta, Georgia
| | - William T Hu
- Department of Neurology, Emory University, Atlanta, Georgia.,Center for Neurodegenerative Diseases, Emory University, Atlanta, Georgia.,Alzheimer's Disease Research Center, Emory University, Atlanta, Georgia
| |
Collapse
|
6
|
O'Brien JJ, Gunawardena HP, Paulo JA, Chen X, Ibrahim JG, Gygi SP, Qaqish BF. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann Appl Stat 2018; 12:2075-2095. [PMID: 30473739 PMCID: PMC6249692 DOI: 10.1214/18-aoas1144] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.
Collapse
Affiliation(s)
- Jonathon J O'Brien
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Harsha P Gunawardena
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Joao A Paulo
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Xian Chen
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Joseph G Ibrahim
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Steven P Gygi
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Bahjat F Qaqish
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| |
Collapse
|
7
|
Statistical characterization of therapeutic protein modifications. Sci Rep 2017; 7:7896. [PMID: 28801661 PMCID: PMC5554216 DOI: 10.1038/s41598-017-08333-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Accepted: 07/07/2017] [Indexed: 12/25/2022] Open
Abstract
Peptide mapping with liquid chromatography–tandem mass spectrometry (LC-MS/MS) is an important analytical method for characterization of post-translational and chemical modifications in therapeutic proteins. Despite its importance, there is currently no consensus on the statistical analysis of the resulting data. In this manuscript, we distinguish three statistical goals for therapeutic protein characterization: (1) estimation of site occupancy of modifications in one condition, (2) detection of differential site occupancy between conditions, and (3) estimation of combined site occupancy across multiple modification sites. We propose an approach, which addresses these goals in terms of summarizing the quantitative information from the mass spectra, statistical modeling, and model-based analysis of LC-MS/MS data. We illustrate the approach using an LC-MS/MS experiment from an antibody-drug conjugate and its monoclonal antibody intermediate. The performance was compared to a ‘naïve’ data analysis approach, by using computer simulation, evaluation of differential site occupancy in positive and negative controls, and comparisons of estimated site occupancy with orthogonal experimental measurements of N-linked glycoforms and total oxidation. The results demonstrated the importance of replicated studies of protein characterization, and of appropriate statistical modeling, for reproducible, accurate and efficient site occupancy estimation and differential analysis.
Collapse
|
8
|
Myint L, Kleensang A, Zhao L, Hartung T, Hansen KD. Joint Bounding of Peaks Across Samples Improves Differential Analysis in Mass Spectrometry-Based Metabolomics. Anal Chem 2017; 89:3517-3523. [PMID: 28221771 PMCID: PMC5362739 DOI: 10.1021/acs.analchem.6b04719] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2016] [Accepted: 02/21/2017] [Indexed: 12/20/2022]
Abstract
As mass spectrometry-based metabolomics becomes more widely used in biomedical research, it is important to revisit existing data analysis paradigms. Existing data preprocessing efforts have largely focused on methods which start by extracting features separately from each sample, followed by a subsequent attempt to group features across samples to facilitate comparisons. We show that this preprocessing approach leads to unnecessary variability in peak quantifications that adversely impacts downstream analysis. We present a new method, bakedpi, for the preprocessing of both centroid and profile mode metabolomics data that relies on an intensity-weighted bivariate kernel density estimation on a pooling of all samples to detect peaks. This new method reduces this unnecessary quantification variability and increases power in downstream differential analysis.
Collapse
Affiliation(s)
- Leslie Myint
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, United States
| | - Andre Kleensang
- Center for Alternatives to Animal Testing (CAAT), Department of Environmental
Health and Engineering, Johns Hopkins Bloomberg
School of Public Health, Baltimore, Maryland 21205, United States
| | - Liang Zhao
- Center for Alternatives to Animal Testing (CAAT), Department of Environmental
Health and Engineering, Johns Hopkins Bloomberg
School of Public Health, Baltimore, Maryland 21205, United States
| | - Thomas Hartung
- Center for Alternatives to Animal Testing (CAAT), Department of Environmental
Health and Engineering, Johns Hopkins Bloomberg
School of Public Health, Baltimore, Maryland 21205, United States
- University of Konstanz, CAAT-Europe, 78457 Konstanz, Germany
| | - Kasper D. Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, United States
- McKusick-Nathans
Institute of Genetic Medicine, Johns Hopkins
University School of Medicine, Baltimore, Maryland 21205, United States
| |
Collapse
|
9
|
Kakourou A, Vach W, Mertens B. Adapting censored regression methods to adjust for the limit of detection in the calibration of diagnostic rules for clinical mass spectrometry proteomic data. Stat Methods Med Res 2016; 27:2742-2755. [DOI: 10.1177/0962280216685742] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this paper, we consider the problem of calibrating diagnostic rules based on high-resolution mass spectrometry data subject to the limit of detection. The limit of detection is related to the limitation of instruments in measuring low-concentration proteins. As a consequence, peak intensities below the limit of detection are often reported as missing during the quantification step of proteomic analysis. We propose the use of censored data methodology to handle spectral measurements within the presence of limit of detection, recognizing that those have been left-censored for low-abundance proteins. We replace the set of incomplete spectral measurements with estimates of the expected intensity and use those as input to a prediction model. To correct for lack of information and measurement uncertainty, we combine this approach with borrowing of information through the addition of an individual-specific random effect formulation. We present different modalities of using the above formulation for prediction purposes and show how it may also allow for variable selection. We evaluate the proposed methods by comparing their predictive performance with the one achieved using the complete information as well as alternative methods to deal with the limit of detection.
Collapse
Affiliation(s)
- Alexia Kakourou
- Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
| | - Werner Vach
- Center for Medical Biometry and Medical Informatics, University of Freiburg, Freiburg, Germany
| | - Bart Mertens
- Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
| |
Collapse
|
10
|
Taylor SL, Ruhaak LR, Weiss RH, Kelly K, Kim K. Multivariate two-part statistics for analysis of correlated mass spectrometry data from multiple biological specimens. Bioinformatics 2016; 33:17-25. [PMID: 27592710 DOI: 10.1093/bioinformatics/btw578] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 08/30/2016] [Accepted: 08/31/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. RESULTS We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. AVAILABILITY AND IMPLEMENTATION We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sandra L Taylor
- Division of Biostatistics, Department of Public Health Sciences, University of California Davis, CA, 95616, USA
| | - L Renee Ruhaak
- Department of Clinical Chemistry and Laboratory Medicine, Leiden University Medical Center, Leiden, The Netherlands
| | | | - Karen Kelly
- Division of Hematology and Oncology, Department of Internal Medicine School of Medicine, University of California, Davis, CA 95616, USA
| | - Kyoungmi Kim
- Division of Biostatistics, Department of Public Health Sciences, University of California Davis, CA, 95616, USA
| |
Collapse
|
11
|
Winderbaum LJ, Koch I, Gustafsson OJR, Meding S, Hoffmann P. Feature extraction for proteomics imaging mass spectrometry data. Ann Appl Stat 2015. [DOI: 10.1214/15-aoas870] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Webb-Robertson BJM, Wiberg HK, Matzke MM, Brown JN, Wang J, McDermott JE, Smith RD, Rodland KD, Metz TO, Pounds JG, Waters KM. Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J Proteome Res 2015; 14:1993-2001. [PMID: 25855118 DOI: 10.1021/pr501138h] [Citation(s) in RCA: 167] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
In this review, we apply selected imputation strategies to label-free liquid chromatography-mass spectrometry (LC-MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC-MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yielded the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. On the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.
Collapse
Affiliation(s)
| | - Holli K Wiberg
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Melissa M Matzke
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Joseph N Brown
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Jing Wang
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Jason E McDermott
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Richard D Smith
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Karin D Rodland
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Thomas O Metz
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Joel G Pounds
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Katrina M Waters
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| |
Collapse
|
13
|
Taylor SL, Leiserowitz GS, Kim K. Accounting for undetected compounds in statistical analyses of mass spectrometry 'omic studies. Stat Appl Genet Mol Biol 2014; 12:703-22. [PMID: 24246290 DOI: 10.1515/sagmb-2013-0021] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Mass spectrometry is an important high-throughput technique for profiling small molecular compounds in biological samples and is widely used to identify potential diagnostic and prognostic compounds associated with disease. Commonly, this data generated by mass spectrometry has many missing values resulting when a compound is absent from a sample or is present but at a concentration below the detection limit. Several strategies are available for statistically analyzing data with missing values. The accelerated failure time (AFT) model assumes all missing values result from censoring below a detection limit. Under a mixture model, missing values can result from a combination of censoring and the absence of a compound. We compare power and estimation of a mixture model to an AFT model. Based on simulated data, we found the AFT model to have greater power to detect differences in means and point mass proportions between groups. However, the AFT model yielded biased estimates with the bias increasing as the proportion of observations in the point mass increased while estimates were unbiased with the mixture model except if all missing observations came from censoring. These findings suggest using the AFT model for hypothesis testing and mixture model for estimation. We demonstrated this approach through application to glycomics data of serum samples from women with ovarian cancer and matched controls.
Collapse
|
14
|
Clough T, Thaminy S, Ragg S, Aebersold R, Vitek O. Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinformatics 2012; 13 Suppl 16:S6. [PMID: 23176351 PMCID: PMC3489535 DOI: 10.1186/1471-2105-13-s16-s6] [Citation(s) in RCA: 100] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is widely used for quantitative proteomic investigations. The typical output of such studies is a list of identified and quantified peptides. The biological and clinical interest is, however, usually focused on quantitative conclusions at the protein level. Furthermore, many investigations ask complex biological questions by studying multiple interrelated experimental conditions. Therefore, there is a need in the field for generic statistical models to quantify protein levels even in complex study designs. RESULTS We propose a general statistical modeling approach for protein quantification in arbitrary complex experimental designs, such as time course studies, or those involving multiple experimental factors. The approach summarizes the quantitative experimental information from all the features and all the conditions that pertain to a protein. It enables both protein significance analysis between conditions, and protein quantification in individual samples or conditions. We implement the approach in an open-source R-based software package MSstats suitable for researchers with a limited statistics and programming background. CONCLUSIONS We demonstrate, using as examples two experimental investigations with complex designs, that a simultaneous statistical modeling of all the relevant features and conditions yields a higher sensitivity of protein significance analysis and a higher accuracy of protein quantification as compared to commonly employed alternatives. The software is available at http://www.stat.purdue.edu/~ovitek/Software.html.
Collapse
Affiliation(s)
- Timothy Clough
- Department of Statistics, Purdue University, West Lafayette, IN, USA.
| | | | | | | | | |
Collapse
|