1
|
Kong W, Hui HWH, Peng H, Goh WWB. Dealing with missing values in proteomics data. Proteomics 2022; 22:e2200092. [PMID: 36349819 DOI: 10.1002/pmic.202200092] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022]
Abstract
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.
Collapse
Affiliation(s)
- Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.,Centre for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
2
|
Ferreira M, Ventorim R, Almeida E, Silveira S, Silveira W. Protein Abundance Prediction Through Machine Learning Methods. J Mol Biol 2021; 433:167267. [PMID: 34563548 DOI: 10.1016/j.jmb.2021.167267] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 09/09/2021] [Accepted: 09/17/2021] [Indexed: 10/20/2022]
Abstract
Proteins are responsible for most physiological processes, and their abundance provides crucial information for systems biology research. However, absolute protein quantification, as determined by mass spectrometry, still has limitations in capturing the protein pool. Protein abundance is impacted by translation kinetics, which rely on features of codons. In this study, we evaluated the effect of codon usage bias of genes on protein abundance. Notably, we observed differences regarding codon usage patterns between genes coding for highly abundant proteins and genes coding for less abundant proteins. Analysis of synonymous codon usage and evolutionary selection showed a clear split between the two groups. Our machine learning models predicted protein abundances from codon usage metrics with remarkable accuracy, achieving strong correlation with experimental data. Upon integration of the predicted protein abundance in enzyme-constrained genome-scale metabolic models, the simulated phenotypes closely matched experimental data, which demonstrates that our predictive models are valuable tools for systems metabolic engineering approaches.
Collapse
Affiliation(s)
- Mauricio Ferreira
- Department of Microbiology, Universidade Federal de Viçosa, Viçosa, MG 36570-900, Brazil. https://twitter.com/@mauriciomyces
| | - Rafaela Ventorim
- Department of Microbiology, Universidade Federal de Viçosa, Viçosa, MG 36570-900, Brazil.
| | - Eduardo Almeida
- Department of Microbiology, Universidade Federal de Viçosa, Viçosa, MG 36570-900, Brazil. https://twitter.com/@elm_almeida
| | - Sabrina Silveira
- Department of Computer Science, Universidade Federal de Viçosa, Viçosa, MG 36570-900, Brazil. https://twitter.com/@sabrina_as
| | - Wendel Silveira
- Department of Microbiology, Universidade Federal de Viçosa, Viçosa, MG 36570-900, Brazil.
| |
Collapse
|
3
|
Zhu X, Wang J, Sun B, Ren C, Yang T, Ding J. An efficient ensemble method for missing value imputation in microarray gene expression data. BMC Bioinformatics 2021; 22:188. [PMID: 33849444 PMCID: PMC8045198 DOI: 10.1186/s12859-021-04109-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 03/29/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.
Collapse
Affiliation(s)
- Xinshan Zhu
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China.,State Key Laboratory of Digital Publishing Technology, Beijing, 100871, China
| | - Jiayu Wang
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
| | - Biao Sun
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China.
| | - Chao Ren
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
| | - Ting Yang
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
| | - Jie Ding
- China Institute of FTZ Supply Chain, Shanghai Maritime University, Shanghai, 201306, China
| |
Collapse
|
4
|
Bramer LM, Irvahn J, Piehowski PD, Rodland KD, Webb-Robertson BJM. A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics. J Proteome Res 2020; 20:1-13. [PMID: 32929967 DOI: 10.1021/acs.jproteome.0c00123] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
The throughput efficiency and increased depth of coverage provided by isobaric-labeled proteomics measurements have led to increased usage of these techniques. However, the structure of missing data is different than unlabeled studies, which prompts the need for this review to compare the efficacy of nine imputation methods on large isobaric-labeled proteomics data sets to guide researchers on the appropriateness of various imputation methods. Imputation methods were evaluated by accuracy, statistical hypothesis test inference, and run time. In general, expectation maximization and random forest imputation methods yielded the best performance, and constant-based methods consistently performed poorly across all data set sizes and percentages of missing values. For data sets with small sample sizes and higher percentages of missing data, results indicate that statistical inference with no imputation may be preferable. On the basis of the findings in this review, there are core imputation methods that perform better for isobaric-labeled proteomics data, but great care and consideration as to whether imputation is the optimal strategy should be given for data sets comprised of a small number of samples.
Collapse
Affiliation(s)
- Lisa M Bramer
- Computing & Analytics Division, Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| | - Jan Irvahn
- Boeing, Seattle, Washington 98055, United States
| | - Paul D Piehowski
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, Washington 99354, United States
| | - Karin D Rodland
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, Washington 99354, United States
| | - Bobbie-Jo M Webb-Robertson
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, Washington 99354, United States
| |
Collapse
|
5
|
Giudice G, Petsalaki E. Proteomics and phosphoproteomics in precision medicine: applications and challenges. Brief Bioinform 2019; 20:767-777. [PMID: 29077858 PMCID: PMC6585152 DOI: 10.1093/bib/bbx141] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2017] [Revised: 09/21/2017] [Indexed: 12/11/2022] Open
Abstract
Recent advances in proteomics allow the accurate measurement of abundances for thousands of proteins and phosphoproteins from multiple samples in parallel. Therefore, for the first time, we have the opportunity to measure the proteomic profiles of thousands of patient samples or disease model cell lines in a systematic way, to identify the precise underlying molecular mechanism and discover personalized biomarkers, networks and treatments. Here, we review examples of successful use of proteomics and phosphoproteomics data sets in as well as their integration other omics data sets with the aim of precision medicine. We will discuss the bioinformatics challenges posed by the generation, analysis and integration of such large data sets and present potential reasons why proteomics profiling and biomarkers are not currently widely used in the clinical setting. We will finally discuss ways to contribute to the better use of proteomics data in precision medicine and the clinical setting.
Collapse
Affiliation(s)
- Girolamo Giudice
- European Molecular Biology Laboratory European Bioinformatics Institute
| | | |
Collapse
|
6
|
Kumar D, Bansal G, Narang A, Basak T, Abbas T, Dash D. Integrating transcriptome and proteome profiling: Strategies and applications. Proteomics 2016; 16:2533-2544. [PMID: 27343053 DOI: 10.1002/pmic.201600140] [Citation(s) in RCA: 108] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 06/12/2016] [Accepted: 06/23/2016] [Indexed: 12/17/2022]
Abstract
Discovering the gene expression signature associated with a cellular state is one of the basic quests in majority of biological studies. For most of the clinical and cellular manifestations, these molecular differences may be exhibited across multiple layers of gene regulation like genomic variations, gene expression, protein translation and post-translational modifications. These system wide variations are dynamic in nature and their crosstalk is overwhelmingly complex, thus analyzing them separately may not be very informative. This necessitates the integrative analysis of such multiple layers of information to understand the interplay of the individual components of the biological system. Recent developments in high throughput RNA sequencing and mass spectrometric (MS) technologies to probe transcripts and proteins made these as preferred methods for understanding global gene regulation. Subsequently, improvements in "big-data" analysis techniques enable novel conclusions to be drawn from integrative transcriptomic-proteomic analysis. The unified analyses of both these data types have been rewarding for several biological objectives like improving genome annotation, predicting RNA-protein quantities, deciphering gene regulations, discovering disease markers and drug targets. There are different ways in which transcriptomics and proteomics data can be integrated; each aiming for different research objectives. Here, we review various studies, approaches and computational tools targeted for integrative analysis of these two high-throughput omics methods.
Collapse
Affiliation(s)
- Dhirendra Kumar
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA
| | - Gourja Bansal
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA
| | - Ankita Narang
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA
| | - Trayambak Basak
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA.,Academy of Scientific & Innovative Research (AcSIR), CSIR-IGIB South Campus, New Delhi, India
| | - Tahseen Abbas
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA.,Academy of Scientific & Innovative Research (AcSIR), CSIR-IGIB South Campus, New Delhi, India
| | - Debasis Dash
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA. , .,Academy of Scientific & Innovative Research (AcSIR), CSIR-IGIB South Campus, New Delhi, India. ,
| |
Collapse
|
7
|
Lin D, Zhang J, Li J, Xu C, Deng HW, Wang YP. An integrative imputation method based on multi-omics datasets. BMC Bioinformatics 2016; 17:247. [PMID: 27329642 PMCID: PMC4915152 DOI: 10.1186/s12859-016-1122-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 06/05/2016] [Indexed: 12/26/2022] Open
Abstract
Background Integrative analysis of multi-omics data is becoming increasingly important to unravel functional mechanisms of complex diseases. However, the currently available multi-omics datasets inevitably suffer from missing values due to technical limitations and various constrains in experiments. These missing values severely hinder integrative analysis of multi-omics data. Current imputation methods mainly focus on using single omics data while ignoring biological interconnections and information imbedded in multi-omics data sets. Results In this study, a novel multi-omics imputation method was proposed to integrate multiple correlated omics datasets for improving the imputation accuracy. Our method was designed to: 1) combine the estimates of missing value from individual omics data itself as well as from other omics, and 2) simultaneously impute multiple missing omics datasets by an iterative algorithm. We compared our method with five imputation methods using single omics data at different noise levels, sample sizes and data missing rates. The results demonstrated the advantage and efficiency of our method, consistently in terms of the imputation error and the recovery of mRNA-miRNA network structure. Conclusions We concluded that our proposed imputation method can utilize more biological information to minimize the imputation error and thus can improve the performance of downstream analysis such as genetic regulatory network construction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1122-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dongdong Lin
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA.,Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Jigang Zhang
- Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA.,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA
| | - Jingyao Li
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA.,Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Chao Xu
- Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA.,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA.,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA. .,Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA. .,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA.
| |
Collapse
|
8
|
A Post-Genomic View of the Ecophysiology, Catabolism and Biotechnological Relevance of Sulphate-Reducing Prokaryotes. Adv Microb Physiol 2015. [PMID: 26210106 DOI: 10.1016/bs.ampbs.2015.05.002] [Citation(s) in RCA: 174] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Dissimilatory sulphate reduction is the unifying and defining trait of sulphate-reducing prokaryotes (SRP). In their predominant habitats, sulphate-rich marine sediments, SRP have long been recognized to be major players in the carbon and sulphur cycles. Other, more recently appreciated, ecophysiological roles include activity in the deep biosphere, symbiotic relations, syntrophic associations, human microbiome/health and long-distance electron transfer. SRP include a high diversity of organisms, with large nutritional versatility and broad metabolic capacities, including anaerobic degradation of aromatic compounds and hydrocarbons. Elucidation of novel catabolic capacities as well as progress in the understanding of metabolic and regulatory networks, energy metabolism, evolutionary processes and adaptation to changing environmental conditions has greatly benefited from genomics, functional OMICS approaches and advances in genetic accessibility and biochemical studies. Important biotechnological roles of SRP range from (i) wastewater and off gas treatment, (ii) bioremediation of metals and hydrocarbons and (iii) bioelectrochemistry, to undesired impacts such as (iv) souring in oil reservoirs and other environments, and (v) corrosion of iron and concrete. Here we review recent advances in our understanding of SRPs focusing mainly on works published after 2000. The wealth of publications in this period, covering many diverse areas, is a testimony to the large environmental, biogeochemical and technological relevance of these organisms and how much the field has progressed in these years, although many important questions and applications remain to be explored.
Collapse
|
9
|
Webb-Robertson BJM, Wiberg HK, Matzke MM, Brown JN, Wang J, McDermott JE, Smith RD, Rodland KD, Metz TO, Pounds JG, Waters KM. Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J Proteome Res 2015; 14:1993-2001. [PMID: 25855118 DOI: 10.1021/pr501138h] [Citation(s) in RCA: 167] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
In this review, we apply selected imputation strategies to label-free liquid chromatography-mass spectrometry (LC-MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC-MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yielded the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. On the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.
Collapse
Affiliation(s)
| | - Holli K Wiberg
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Melissa M Matzke
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Joseph N Brown
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Jing Wang
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Jason E McDermott
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Richard D Smith
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Karin D Rodland
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Thomas O Metz
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Joel G Pounds
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| | - Katrina M Waters
- Pacific Northwest National Laboratory, PO BOX 999, K7-20, Richland, Washington 99352, United States
| |
Collapse
|
10
|
Haider S, Pal R. Integrated analysis of transcriptomic and proteomic data. Curr Genomics 2013; 14:91-110. [PMID: 24082820 PMCID: PMC3637682 DOI: 10.2174/1389202911314020003] [Citation(s) in RCA: 273] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2012] [Revised: 01/09/2013] [Accepted: 01/22/2013] [Indexed: 12/14/2022] Open
Abstract
Until recently, understanding the regulatory behavior of cells has been pursued through independent analysis of the transcriptome or the proteome. Based on the central dogma, it was generally assumed that there exist a direct correspondence between mRNA transcripts and generated protein expressions. However, recent studies have shown that the correlation between mRNA and Protein expressions can be low due to various factors such as different half lives and post transcription machinery. Thus, a joint analysis of the transcriptomic and proteomic data can provide useful insights that may not be deciphered from individual analysis of mRNA or protein expressions. This article reviews the existing major approaches for joint analysis of transcriptomic and proteomic data. We categorize the different approaches into eight main categories based on the initial algorithm and final analysis goal. We further present analogies with other domains and discuss the existing research problems in this area.
Collapse
Affiliation(s)
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409, USA
| |
Collapse
|