1
|
Imputation of Missing Values for Multi-Biospecimen Metabolomics Studies: Bias and Effects on Statistical Validity. Metabolites 2022; 12:metabo12070671. [PMID: 35888795 PMCID: PMC9317643 DOI: 10.3390/metabo12070671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 07/07/2022] [Accepted: 07/19/2022] [Indexed: 02/05/2023] Open
Abstract
The analysis of high-throughput metabolomics mass spectrometry data across multiple biological sample types (biospecimens) poses challenges due to missing data. During differential abundance analysis, dropping samples with missing values can lead to severe loss of data as well as biased results in group comparisons and effect size estimates. However, the imputation of missing data (the process of replacing missing data with estimated values such as a mean) may compromise the inherent intra-subject correlation of a metabolite across multiple biospecimens from the same subject, which in turn may compromise the efficacy of the statistical analysis of differential metabolites in biomarker discovery. We investigated imputation strategies when considering multiple biospecimens from the same subject. We compared a novel, but simple, approach that consists of combining the two biospecimen data matrices (rows and columns of subjects and metabolites) and imputes the two biospecimen data matrices together to an approach that imputes each biospecimen data matrix separately. We then compared the bias in the estimation of the intra-subject multi-specimen correlation and its effects on the validity of statistical significance tests between two approaches. The combined approach to multi-biospecimen studies has not been evaluated previously even though it is intuitive and easy to implement. We examine these two approaches for five imputation methods: random forest, k nearest neighbor, expectation-maximization with bootstrap, quantile regression, and half the minimum observed value. Combining the biospecimen data matrices for imputation did not greatly increase efficacy in conserving the correlation structure or improving accuracy in the statistical conclusions for most of the methods examined. Random forest tended to outperform the other methods in all performance metrics, except specificity.
Collapse
|
2
|
Taylor S, Ponzini M, Wilson M, Kim K. Comparison of imputation and imputation-free methods for statistical analysis of mass spectrometry data with missing data. Brief Bioinform 2021; 23:6361033. [PMID: 34472591 DOI: 10.1093/bib/bbab353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 07/27/2021] [Accepted: 08/10/2021] [Indexed: 11/14/2022] Open
Abstract
Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values and apply statistical methods that require complete data and (ii) use statistical methods that specifically account for missing values without imputation (imputation-free methods). This study reviews the effect of sample size and percentage of missing values on statistical inference for multiple methods under these two strategies. With increasing missingness, the ability of imputation and imputation-free methods to identify differentially and non-differentially regulated compounds in a two-group comparison study declined. Random forest and k-nearest neighbor imputation combined with a Wilcoxon test performed well in statistical testing for up to 50% missingness with little bias in estimating the effect size. Quantile regression imputation accompanied with a Wilcoxon test also had good statistical testing outcomes but substantially distorted the difference in means between groups. None of the imputation-free methods performed consistently better for statistical testing than imputation methods.
Collapse
Affiliation(s)
- Sandra Taylor
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| | - Matthew Ponzini
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| | - Machelle Wilson
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| | - Kyoungmi Kim
- Division of Biostatistics, School of Medicine at the University of California, Davis, 2921 Stockton Boulevard, Suite 1400, Sacramento, CA 95817, USA
| |
Collapse
|
3
|
Wijasa TS, Sylvester M, Brocke-Ahmadinejad N, Schwartz S, Santarelli F, Gieselmann V, Klockgether T, Brosseron F, Heneka MT. Quantitative proteomics of synaptosome S-nitrosylation in Alzheimer's disease. J Neurochem 2019; 152:710-726. [PMID: 31520481 DOI: 10.1111/jnc.14870] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 08/23/2019] [Accepted: 09/04/2019] [Indexed: 12/20/2022]
Abstract
Increasing evidence suggests that both synaptic loss and neuroinflammation constitute early pathologic hallmarks of Alzheimer's disease. A downstream event during inflammatory activation of microglia and astrocytes is the induction of nitric oxide synthase type 2, resulting in an increased release of nitric oxide and the post-translational S-nitrosylation of protein cysteine residues. Both early events, inflammation and synaptic dysfunction, could be connected if this excess nitrosylation occurs on synaptic proteins. In the long term, such changes could provide new insight into patho-mechanisms as well as biomarker candidates from the early stages of disease progression. This study investigated S-nitrosylation in synaptosomal proteins isolated from APP/PS1 model mice in comparison to wild type and NOS2-/- mice, as well as human control, mild cognitive impairment and Alzheimer's disease brain tissues. Proteomics data were obtained using an established protocol utilizing an isobaric mass tag method, followed by nanocapillary high performance liquid chromatography tandem mass spectrometry. Statistical analysis identified the S-nitrosylation sites most likely derived from an increase in nitric oxide (NO) in dependence of presence of AD pathology, age and the key enzyme NOS2. The resulting list of candidate proteins is discussed considering function, previous findings in the context of neurodegeneration, and the potential for further validation studies.
Collapse
Affiliation(s)
| | - Marc Sylvester
- Institute of Biochemistry and Molecular Biology, University of Bonn, Bonn, Germany
| | | | - Stephanie Schwartz
- Department of Neurodegenerative Diseases and Geriatric Psychiatry, University Hospital Bonn, Bonn, Germany
| | | | - Volkmar Gieselmann
- Institute of Biochemistry and Molecular Biology, University of Bonn, Bonn, Germany
| | - Thomas Klockgether
- German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany.,Department of Neurology, University of Bonn, Bonn, Germany
| | | | - Michael T Heneka
- German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany.,Department of Neurodegenerative Diseases and Geriatric Psychiatry, University Hospital Bonn, Bonn, Germany
| |
Collapse
|
4
|
Ebner JN, Ritz D, von Fumetti S. Comparative proteomics of stenotopic caddisfly Crunoecia irrorata identifies acclimation strategies to warming. Mol Ecol 2019; 28:4453-4469. [PMID: 31478292 PMCID: PMC6856850 DOI: 10.1111/mec.15225] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 07/28/2019] [Accepted: 07/29/2019] [Indexed: 12/23/2022]
Abstract
Species' ecological preferences are often deduced from habitat characteristics thought to represent more or less optimal conditions for physiological functioning. Evolution has led to stenotopic and eurytopic species, the former having decreased niche breadths and lower tolerances to environmental variability. Species inhabiting freshwater springs are often described as being stenotopic specialists, adapted to the stable thermal conditions found in these habitats. Whether due to past local adaptation these species have evolved or have lost intra-generational adaptive mechanisms to cope with increasing thermal variability has, to our knowledge, never been investigated. By studying how the proteome of a stenotopic species changes as a result of increasing temperatures, we investigate if the absence or attenuation of molecular mechanisms is indicative of local adaptation to freshwater springs. An understanding of compensatory mechanisms is especially relevant as spring specialists will experience thermal conditions beyond their physiological limits due to climate change. In this study, the stenotopic species Crunoecia irrorata (Trichoptera: Lepidostomatidae, Curtis 1834) was acclimated to 10, 15 and 20°C for 168 hr. We constructed a homology-based database and via liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based shotgun proteomics identified 1,358 proteins. Differentially abundant proteins and protein norms of reaction revealed candidate proteins and molecular mechanisms facilitating compensatory responses such as trehalose metabolism, tracheal system alteration and heat-shock protein regulation. A species-specific understanding of compensatory physiologies challenges the characterization of species as having narrow tolerances to environmental variability if that characterization is based on occurrences and habitat characteristics alone.
Collapse
Affiliation(s)
- Joshua N. Ebner
- Geoecology Research GroupDepartment of Environmental SciencesUniversity of BaselBaselSwitzerland
| | - Danilo Ritz
- Proteomics Core FacilityBiozentrumUniversity of BaselBaselSwitzerland
| | - Stefanie von Fumetti
- Geoecology Research GroupDepartment of Environmental SciencesUniversity of BaselBaselSwitzerland
| |
Collapse
|
5
|
Liang Y, Kelemen A, Kelemen A. Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies. Stat Appl Genet Mol Biol 2019; 18:sagmb-2018-0039. [PMID: 31077580 DOI: 10.1515/sagmb-2018-0039] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore, MD 21201-1579, USA
| | - Adam Kelemen
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | - Arpad Kelemen
- Department of Organizational Systems and Adult Health, University of Maryland, Baltimore, MD 21201-1579, USA
| |
Collapse
|
6
|
Pascovici D, Wu JX, McKay MJ, Joseph C, Noor Z, Kamath K, Wu Y, Ranganathan S, Gupta V, Mirzaei M. Clinically Relevant Post-Translational Modification Analyses-Maturing Workflows and Bioinformatics Tools. Int J Mol Sci 2018; 20:E16. [PMID: 30577541 PMCID: PMC6337699 DOI: 10.3390/ijms20010016] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 12/09/2018] [Accepted: 12/17/2018] [Indexed: 01/04/2023] Open
Abstract
Post-translational modifications (PTMs) can occur soon after translation or at any stage in the lifecycle of a given protein, and they may help regulate protein folding, stability, cellular localisation, activity, or the interactions proteins have with other proteins or biomolecular species. PTMs are crucial to our functional understanding of biology, and new quantitative mass spectrometry (MS) and bioinformatics workflows are maturing both in labelled multiplexed and label-free techniques, offering increasing coverage and new opportunities to study human health and disease. Techniques such as Data Independent Acquisition (DIA) are emerging as promising approaches due to their re-mining capability. Many bioinformatics tools have been developed to support the analysis of PTMs by mass spectrometry, from prediction and identifying PTM site assignment, open searches enabling better mining of unassigned mass spectra-many of which likely harbour PTMs-through to understanding PTM associations and interactions. The remaining challenge lies in extracting functional information from clinically relevant PTM studies. This review focuses on canvassing the options and progress of PTM analysis for large quantitative studies, from choosing the platform, through to data analysis, with an emphasis on clinically relevant samples such as plasma and other body fluids, and well-established tools and options for data interpretation.
Collapse
Affiliation(s)
- Dana Pascovici
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
- Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia.
| | - Jemma X Wu
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
- Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia.
| | - Matthew J McKay
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
- Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia.
| | - Chitra Joseph
- Department of Clinical Medicine, Macquarie University, Sydney, NSW 2109, Australia.
| | - Zainab Noor
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
| | - Karthik Kamath
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
- Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia.
| | - Yunqi Wu
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
- Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia.
| | - Shoba Ranganathan
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
| | - Vivek Gupta
- Department of Clinical Medicine, Macquarie University, Sydney, NSW 2109, Australia.
| | - Mehdi Mirzaei
- Department of Molecular Sciences, Macquarie University, Sydney, NSW 2109, Australia.
- Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia.
- Department of Clinical Medicine, Macquarie University, Sydney, NSW 2109, Australia.
| |
Collapse
|
7
|
Shu L, Arneson D, Yang X. Bioinformatics Principles for Deciphering Cardiovascular Diseases. ENCYCLOPEDIA OF CARDIOVASCULAR RESEARCH AND MEDICINE 2018:273-292. [DOI: 10.1016/b978-0-12-809657-4.99576-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
8
|
Statistical characterization of therapeutic protein modifications. Sci Rep 2017; 7:7896. [PMID: 28801661 PMCID: PMC5554216 DOI: 10.1038/s41598-017-08333-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Accepted: 07/07/2017] [Indexed: 12/25/2022] Open
Abstract
Peptide mapping with liquid chromatography–tandem mass spectrometry (LC-MS/MS) is an important analytical method for characterization of post-translational and chemical modifications in therapeutic proteins. Despite its importance, there is currently no consensus on the statistical analysis of the resulting data. In this manuscript, we distinguish three statistical goals for therapeutic protein characterization: (1) estimation of site occupancy of modifications in one condition, (2) detection of differential site occupancy between conditions, and (3) estimation of combined site occupancy across multiple modification sites. We propose an approach, which addresses these goals in terms of summarizing the quantitative information from the mass spectra, statistical modeling, and model-based analysis of LC-MS/MS data. We illustrate the approach using an LC-MS/MS experiment from an antibody-drug conjugate and its monoclonal antibody intermediate. The performance was compared to a ‘naïve’ data analysis approach, by using computer simulation, evaluation of differential site occupancy in positive and negative controls, and comparisons of estimated site occupancy with orthogonal experimental measurements of N-linked glycoforms and total oxidation. The results demonstrated the importance of replicated studies of protein characterization, and of appropriate statistical modeling, for reproducible, accurate and efficient site occupancy estimation and differential analysis.
Collapse
|
9
|
Taylor SL, Ruhaak LR, Weiss RH, Kelly K, Kim K. Multivariate two-part statistics for analysis of correlated mass spectrometry data from multiple biological specimens. Bioinformatics 2016; 33:17-25. [PMID: 27592710 DOI: 10.1093/bioinformatics/btw578] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 08/30/2016] [Accepted: 08/31/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. RESULTS We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. AVAILABILITY AND IMPLEMENTATION We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sandra L Taylor
- Division of Biostatistics, Department of Public Health Sciences, University of California Davis, CA, 95616, USA
| | - L Renee Ruhaak
- Department of Clinical Chemistry and Laboratory Medicine, Leiden University Medical Center, Leiden, The Netherlands
| | | | - Karen Kelly
- Division of Hematology and Oncology, Department of Internal Medicine School of Medicine, University of California, Davis, CA 95616, USA
| | - Kyoungmi Kim
- Division of Biostatistics, Department of Public Health Sciences, University of California Davis, CA, 95616, USA
| |
Collapse
|
10
|
A Bayesian algorithm for detecting differentially expressed proteins and its application in breast cancer research. Sci Rep 2016; 6:30159. [PMID: 27444576 PMCID: PMC4957118 DOI: 10.1038/srep30159] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Accepted: 06/28/2016] [Indexed: 02/07/2023] Open
Abstract
Presence of considerable noise and missing data points make analysis of mass-spectrometry (MS) based proteomic data a challenging task. The missing values in MS data are caused by the inability of MS machines to reliably detect proteins whose abundances fall below the detection limit. We developed a Bayesian algorithm that exploits this knowledge and uses missing data points as a complementary source of information to the observed protein intensities in order to find differentially expressed proteins by analysing MS based proteomic data. We compared its accuracy with many other methods using several simulated datasets. It consistently outperformed other methods. We then used it to analyse proteomic screens of a breast cancer (BC) patient cohort. It revealed large differences between the proteomic landscapes of triple negative and Luminal A, which are the most and least aggressive types of BC. Unexpectedly, majority of these differences could be attributed to the direct transcriptional activity of only seven transcription factors some of which are known to be inactive in triple negative BC. We also identified two new proteins which significantly correlated with the survival of BC patients, and therefore may have potential diagnostic/prognostic values.
Collapse
|
11
|
Tang S, Hemberg M, Cansizoglu E, Belin S, Kosik K, Kreiman G, Steen H, Steen J. f-divergence cutoff index to simultaneously identify differential expression in the integrated transcriptome and proteome. Nucleic Acids Res 2016; 44:e97. [PMID: 26980280 PMCID: PMC4889934 DOI: 10.1093/nar/gkw157] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 02/28/2016] [Indexed: 11/16/2022] Open
Abstract
The ability to integrate ‘omics’ (i.e. transcriptomics and proteomics) is becoming increasingly important to the understanding of regulatory mechanisms. There are currently no tools available to identify differentially expressed genes (DEGs) across different ‘omics’ data types or multi-dimensional data including time courses. We present fCI (f-divergence Cut-out Index), a model capable of simultaneously identifying DEGs from continuous and discrete transcriptomic, proteomic and integrated proteogenomic data. We show that fCI can be used across multiple diverse sets of data and can unambiguously find genes that show functional modulation, developmental changes or misregulation. Applying fCI to several proteogenomics datasets, we identified a number of important genes that showed distinctive regulation patterns. The package fCI is available at R Bioconductor and http://software.steenlab.org/fCI/.
Collapse
Affiliation(s)
- Shaojun Tang
- Departments of Pathology, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Martin Hemberg
- Department of Ophthalmology, Boston Children's Hospital, Boston, MA 02115, USA
| | - Ertugrul Cansizoglu
- F.M. Kirby Neurobiology Center, Boston Children's Hospital, and Department of Neurology, Harvard Medical School, Boston, MA 02115, USA
| | - Stephane Belin
- F.M. Kirby Neurobiology Center, Boston Children's Hospital, and Department of Neurology, Harvard Medical School, Boston, MA 02115, USA
| | - Kenneth Kosik
- Neuroscience Research Institute, University of California at Santa Barbara, Santa Barbara, CA 93106, USA
| | - Gabriel Kreiman
- Department of Ophthalmology, Boston Children's Hospital, Boston, MA 02115, USA
| | - Hanno Steen
- Departments of Pathology, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Judith Steen
- F.M. Kirby Neurobiology Center, Boston Children's Hospital, and Department of Neurology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
12
|
Computational and statistical methods for high-throughput analysis of post-translational modifications of proteins. J Proteomics 2015. [PMID: 26216596 DOI: 10.1016/j.jprot.2015.07.016] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The investigation of post-translational modifications (PTMs) represents one of the main research focuses for the study of protein function and cell signaling. Mass spectrometry instrumentation with increasing sensitivity improved protocols for PTM enrichment and recently established pipelines for high-throughput experiments allow large-scale identification and quantification of several PTM types. This review addresses the concurrently emerging challenges for the computational analysis of the resulting data and presents PTM-centered approaches for spectra identification, statistical analysis, multivariate analysis and data interpretation. We furthermore discuss the potential of future developments that will help to gain deep insight into the PTM-ome and its biological role in cells. This article is part of a Special Issue entitled: Computational Proteomics.
Collapse
|
13
|
Gibb S, Strimmer K. Differential protein expression and peak selection in mass spectrometry data by binary discriminant analysis. Bioinformatics 2015; 31:3156-62. [PMID: 26026136 DOI: 10.1093/bioinformatics/btv334] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 05/26/2015] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. RESULTS Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the 'peak probability contrast' approach of Tibshirani et al. (2004) and can be applied both in the two-group and the multi-group setting. Our approach is computationally inexpensive and shows in the analysis of a large-scale drug discovery test dataset equivalent prediction accuracy as a random forest. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study. AVAILABILITY AND IMPLEMENTATION The methodology for binary discriminant analysis is implemented in the R package binda, which is freely available under the GNU General Public License (version 3 or later) from CRAN at URL http://cran.r-project.org/web/packages/binda/. R scripts reproducing all described analyzes are available from the web page http://strimmerlab.org/software/binda/. CONTACT k.strimmer@imperial.ac.uk.
Collapse
Affiliation(s)
- Sebastian Gibb
- Anesthesiology and Intensive Care Medicine, University Hospital Greifswald, Ferdinand-Sauerbruch-Straße, D-17475 Greifswald, Germany and
| | - Korbinian Strimmer
- Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, W2 1PG, UK
| |
Collapse
|
14
|
Zhan X, Patterson AD, Ghosh D. Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics 2015; 16:77. [PMID: 25887233 PMCID: PMC4359587 DOI: 10.1186/s12859-015-0506-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Accepted: 02/20/2015] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Data generated from metabolomics experiments are different from other types of "-omics" data. For example, a common phenomenon in mass spectrometry (MS)-based metabolomics data is that the data matrix frequently contains missing values, which complicates some quantitative analyses. One way to tackle this problem is to treat them as absent. Hence there are two types of information that are available in metabolomics data: presence/absence of a metabolite and a quantitative value of the abundance level of a metabolite if it is present. Combining these two layers of information poses challenges to the application of traditional statistical approaches in differential expression analysis. RESULTS In this article, we propose a novel kernel-based score test for the metabolomics differential expression analysis. In order to simultaneously capture both the continuous pattern and discrete pattern in metabolomics data, two new kinds of kernels are designed. One is the distance-based kernel and the other is the stratified kernel. While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well. CONCLUSIONS Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives. An implementation of the proposed kernel method in the R statistical computing environment is available at http://works.bepress.com/debashis_ghosh/60/ .
Collapse
Affiliation(s)
- Xiang Zhan
- Department of Statistics, Pennsylvania State University, 325 Thomas Building, University Park, 16802, PA, USA.
| | - Andrew D Patterson
- Department of Molecular Toxicology, Pennsylvania State University, 322 Life Sciences Bldg, University Park, 16802, PA, USA.
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, 13001 East 17th Place, Aurora, 80045, CO, USA.
| |
Collapse
|
15
|
Jow H, Boys RJ, Wilkinson DJ. Bayesian identification of protein differential expression in multi-group isobaric labelled mass spectrometry data. Stat Appl Genet Mol Biol 2014; 13:531-51. [PMID: 25153608 DOI: 10.1515/sagmb-2012-0066] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this paper we develop a Bayesian statistical inference approach to the unified analysis of isobaric labelled MS/MS proteomic data across multiple experiments. An explicit probabilistic model of the log-intensity of the isobaric labels' reporter ions across multiple pre-defined groups and experiments is developed. This is then used to develop a full Bayesian statistical methodology for the identification of differentially expressed proteins, with respect to a control group, across multiple groups and experiments. This methodology is implemented and then evaluated on simulated data and on two model experimental datasets (for which the differentially expressed proteins are known) that use a TMT labelling protocol.
Collapse
|
16
|
Taylor SL, Leiserowitz GS, Kim K. Accounting for undetected compounds in statistical analyses of mass spectrometry 'omic studies. Stat Appl Genet Mol Biol 2014; 12:703-22. [PMID: 24246290 DOI: 10.1515/sagmb-2013-0021] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Mass spectrometry is an important high-throughput technique for profiling small molecular compounds in biological samples and is widely used to identify potential diagnostic and prognostic compounds associated with disease. Commonly, this data generated by mass spectrometry has many missing values resulting when a compound is absent from a sample or is present but at a concentration below the detection limit. Several strategies are available for statistically analyzing data with missing values. The accelerated failure time (AFT) model assumes all missing values result from censoring below a detection limit. Under a mixture model, missing values can result from a combination of censoring and the absence of a compound. We compare power and estimation of a mixture model to an AFT model. Based on simulated data, we found the AFT model to have greater power to detect differences in means and point mass proportions between groups. However, the AFT model yielded biased estimates with the bias increasing as the proportion of observations in the point mass increased while estimates were unbiased with the mixture model except if all missing observations came from censoring. These findings suggest using the AFT model for hypothesis testing and mixture model for estimation. We demonstrated this approach through application to glycomics data of serum samples from women with ovarian cancer and matched controls.
Collapse
|
17
|
Ryu SY, Qian WJ, Camp DG, Smith RD, Tompkins RG, Davis RW, Xiao W. Detecting differential protein expression in large-scale population proteomics. ACTA ACUST UNITED AC 2014; 30:2741-6. [PMID: 24928210 DOI: 10.1093/bioinformatics/btu341] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
MOTIVATION Mass spectrometry (MS)-based high-throughput quantitative proteomics shows great potential in large-scale clinical biomarker studies, identifying and quantifying thousands of proteins in biological samples. However, there are unique challenges in analyzing the quantitative proteomics data. One issue is that the quantification of a given peptide is often missing in a subset of the experiments, especially for less abundant peptides. Another issue is that different MS experiments of the same study have significantly varying numbers of peptides quantified, which can result in more missing peptide abundances in an experiment that has a smaller total number of quantified peptides. To detect as many biomarker proteins as possible, it is necessary to develop bioinformatics methods that appropriately handle these challenges. RESULTS We propose a Significance Analysis for Large-scale Proteomics Studies (SALPS) that handles missing peptide intensity values caused by the two mechanisms mentioned above. Our model has a robust performance in both simulated data and proteomics data from a large clinical study. Because varying patients' sample qualities and deviating instrument performances are not avoidable for clinical studies performed over the course of several years, we believe that our approach will be useful to analyze large-scale clinical proteomics data. AVAILABILITY AND IMPLEMENTATION R codes for SALPS are available at http://www.stanford.edu/%7eclairesr/software.html.
Collapse
Affiliation(s)
- So Young Ryu
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Wei-Jun Qian
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - David G Camp
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Richard D Smith
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Ronald G Tompkins
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Ronald W Davis
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Wenzhong Xiao
- Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| |
Collapse
|
18
|
Abstract
The conventional reductionist approach to cardiovascular research investigates individual candidate factors or linear signalling pathways but ignores more complex interactions in biological systems. The advent of molecular profiling technologies that focus on a global characterization of whole complements allows an exploration of the interconnectivity of pathways during pathophysiologically relevant processes, but has brought about the issue of statistical analysis and data integration. Proteins identified by differential expression as well as those in protein–protein interaction networks identified through experiments and through computational modelling techniques can be used as an initial starting point for functional analyses. In combination with other ‘-omics’ technologies, such as transcriptomics and metabolomics, proteomics explores different aspects of disease, and the different pillars of observations facilitate the data integration in disease-specific networks. Ultimately, a systems biology approach may advance our understanding of cardiovascular disease processes at a ‘biological pathway’ instead of a ‘single molecule’ level and accelerate progress towards disease-modifying interventions.
Collapse
Affiliation(s)
- Sarah R Langley
- King's British Heart Foundation Centre, King's College London, 125 Coldharbour Lane, London SE5 9NU, UK
| | | | | | | | | |
Collapse
|
19
|
Label-free quantitative proteomics trends for protein-protein interactions. J Proteomics 2012; 81:91-101. [PMID: 23153790 DOI: 10.1016/j.jprot.2012.10.027] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Revised: 10/24/2012] [Accepted: 10/31/2012] [Indexed: 12/14/2022]
Abstract
Understanding protein interactions within the complexity of a living cell is challenging, but techniques coupling affinity purification and mass spectrometry have enabled important progress to be made in the past 15 years. As identification of protein-protein interactions is becoming easier, the quantification of the interaction dynamics is the next frontier. Several quantitative mass spectrometric approaches have been developed to address this issue that vary in their strengths and weaknesses. While isotopic labeling approaches continue to contribute to the identification of regulated interactions, techniques that do not require labeling are becoming increasingly used in the field. Here, we describe the major types of label-free quantification used in interaction proteomics, and discuss the relative merits of data dependent and data independent acquisition approaches in label-free quantification. This article is part of a Special Issue entitled: From protein structures to clinical applications.
Collapse
|
20
|
Matzke MM, Brown JN, Gritsenko MA, Metz TO, Pounds JG, Rodland KD, Shukla AK, Smith RD, Waters KM, McDermott JE, Webb-Robertson BJ. A comparative analysis of computational approaches to relative protein quantification using peptide peak intensities in label-free LC-MS proteomics experiments. Proteomics 2012; 13:493-503. [PMID: 23019139 DOI: 10.1002/pmic.201200269] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2012] [Revised: 08/14/2012] [Accepted: 08/22/2012] [Indexed: 12/24/2022]
Abstract
Liquid chromatography coupled with mass spectrometry (LC-MS) is widely used to identify and quantify peptides in complex biological samples. In particular, label-free shotgun proteomics is highly effective for the identification of peptides and subsequently obtaining a global protein profile of a sample. As a result, this approach is widely used for discovery studies. Typically, the objective of these discovery studies is to identify proteins that are affected by some condition of interest (e.g. disease, exposure). However, for complex biological samples, label-free LC-MS proteomics experiments measure peptides and do not directly yield protein quantities. Thus, protein quantification must be inferred from one or more measured peptides. In recent years, many computational approaches to relative protein quantification of label-free LC-MS data have been published. In this review, we examine the most commonly employed quantification approaches to relative protein abundance from peak intensity values, evaluate their individual merits, and discuss challenges in the use of the various computational approaches.
Collapse
|
21
|
Clough T, Thaminy S, Ragg S, Aebersold R, Vitek O. Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinformatics 2012; 13 Suppl 16:S6. [PMID: 23176351 PMCID: PMC3489535 DOI: 10.1186/1471-2105-13-s16-s6] [Citation(s) in RCA: 100] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is widely used for quantitative proteomic investigations. The typical output of such studies is a list of identified and quantified peptides. The biological and clinical interest is, however, usually focused on quantitative conclusions at the protein level. Furthermore, many investigations ask complex biological questions by studying multiple interrelated experimental conditions. Therefore, there is a need in the field for generic statistical models to quantify protein levels even in complex study designs. RESULTS We propose a general statistical modeling approach for protein quantification in arbitrary complex experimental designs, such as time course studies, or those involving multiple experimental factors. The approach summarizes the quantitative experimental information from all the features and all the conditions that pertain to a protein. It enables both protein significance analysis between conditions, and protein quantification in individual samples or conditions. We implement the approach in an open-source R-based software package MSstats suitable for researchers with a limited statistics and programming background. CONCLUSIONS We demonstrate, using as examples two experimental investigations with complex designs, that a simultaneous statistical modeling of all the relevant features and conditions yields a higher sensitivity of protein significance analysis and a higher accuracy of protein quantification as compared to commonly employed alternatives. The software is available at http://www.stat.purdue.edu/~ovitek/Software.html.
Collapse
Affiliation(s)
- Timothy Clough
- Department of Statistics, Purdue University, West Lafayette, IN, USA.
| | | | | | | | | |
Collapse
|