1
|
Wu W, Huang Z, Kong W, Peng H, Goh WWB. Optimizing the PROTREC network-based missing protein prediction algorithm. Proteomics 2024; 24:e2200332. [PMID: 37876146 DOI: 10.1002/pmic.202200332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 09/30/2023] [Accepted: 10/06/2023] [Indexed: 10/26/2023]
Abstract
This article summarizes the PROTREC method and investigates the impact that the different hyper-parameters have on the task of missing protein prediction using PROTREC. We evaluate missing protein recovery rates using different PROTREC score selection approaches (MAX, MIN, MEDIAN, and MEAN), different PROTREC score thresholds, as well as different complex size thresholds. In addition, we included two additional cancer datasets in our analysis and introduced a new validation method to check both the robustness of the PROTREC method as well as the correctness of our analysis. Our analysis showed that the missing protein recovery rate can be improved by adopting PROTREC score selection operations of MIN, MEDIAN, and MEAN instead of the default MAX. However, this may come at a cost of reduced numbers of proteins predicted and validated. The users should therefore choose their hyper-parameters carefully to find a balance in the accuracy-quantity trade-off. We also explored the possibility of combining PROTREC with a p-value-based method (FCS) and demonstrated that PROTREC is able to perform well independently without any help from a p-value-based method. Furthermore, we conducted a downstream enrichment analysis to understand the biological pathways and protein networks within the cancerous tissues using the recovered proteins. Missing protein recovery rate using PROTREC can be improved by selecting a different PROTREC score selection method. Different PROTREC score selection methods and other hyper-parameters such as PROTREC score threshold and complex size threshold introduce accuracy-quantity trade-off. PROTREC is able to perform well independently of any filtering using a p-value-based method. Verification of the PROTREC method on additional cancer datasets. Downstream Enrichment Analysis to understand the biological pathways and protein networks in cancerous tissues.
Collapse
Affiliation(s)
- Wenshan Wu
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Zelu Huang
- School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, Singapore, Singapore
| | - Weijia Kong
- Department of Computer Science, National University of Singapore, Singapore, Singapore
- School of Biological Science, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- School of Biological Science, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- School of Biological Science, Nanyang Technological University, Singapore, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
2
|
What can scatterplots teach us about doing data science better? INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2022. [DOI: 10.1007/s41060-022-00362-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
3
|
Kong W, Wong BJH, Gao H, Guo T, Liu X, Du X, Wong L, Goh WWB. PROTREC: A probability-based approach for recovering missing proteins based on biological networks. J Proteomics 2022; 250:104392. [PMID: 34626823 DOI: 10.1016/j.jprot.2021.104392] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 08/30/2021] [Accepted: 09/02/2021] [Indexed: 12/18/2022]
Abstract
A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods - such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) - across a variety of proteomics datasets derived from different proteomics data acquisition paradigms: Higher PROTREC scores are much more closely correlated with higher recovery rates of MPs across sample replicates. The PROTREC score, unlike methods reporting p-values, can be directly interpreted as the probability that an unreported protein in a proteomic screen is actually present in the sample being screened. SIGNIFICANCE: Mass spectrometry (MS) has developed rapidly in recent years; however, an obvious proportion of proteins is still undetected, leading to missing protein problems. A few existing protein recovery methods are based on biological networks, but the performance is not satisfactory. We propose a new protein recovery method, PROTREC, a Bayesian-inspired approach based on biological networks, which shows exceptional performance across multiple validation strategies. It does not rely on peptide information, so it avoids the ambiguity issue that most protein assembly methods face.
Collapse
Affiliation(s)
- Weijia Kong
- School of Biological Sciences, Nanyang Technological University, Singapore; Department of Computer Science, National University of Singapore, Singapore
| | | | - Huanhuan Gao
- Zhejiang Provincial Laboratory of Life Sciences and Biomedicine, Key Laboratory of Structural Biology of Zhejiang Province, School of Life Sciences, Westlake University, Zhejiang, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Zhejiang Province, China
| | - Tiannan Guo
- Zhejiang Provincial Laboratory of Life Sciences and Biomedicine, Key Laboratory of Structural Biology of Zhejiang Province, School of Life Sciences, Westlake University, Zhejiang, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Zhejiang Province, China
| | - Xianming Liu
- Bruker (Beijing) Scientific Technology Co., Ltd, Shanghai, China
| | - Xiaoxian Du
- Bruker (Beijing) Scientific Technology Co., Ltd, Shanghai, China
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore.
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore; Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore.
| |
Collapse
|
4
|
Aladeokin AC, Akiyama T, Kimura A, Kimura Y, Takahashi-Jitsuki A, Nakamura H, Makihara H, Masukawa D, Nakabayashi J, Hirano H, Nakamura F, Saito T, Saido T, Goshima Y. Network-guided analysis of hippocampal proteome identifies novel proteins that colocalize with Aβ in a mice model of early-stage Alzheimer’s disease. Neurobiol Dis 2019; 132:104603. [DOI: 10.1016/j.nbd.2019.104603] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 07/12/2019] [Accepted: 09/02/2019] [Indexed: 12/14/2022] Open
|
5
|
Proteomic investigation of intra-tumor heterogeneity using network-based contextualization - A case study on prostate cancer. J Proteomics 2019; 206:103446. [PMID: 31323421 DOI: 10.1016/j.jprot.2019.103446] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 06/12/2019] [Accepted: 07/08/2019] [Indexed: 12/26/2022]
Abstract
Cancer is a heterogeneous disease, confounding the identification of relevant markers and drug targets. Network-based analysis is robust against noise, potentially offering a promising approach towards biomarker identification. We describe here the application of two network-based methods, qPSP (Quantitative Proteomics Signature Profiling) and PFSNet (Paired Fuzzy SubNetworks), in an intra-tissue proteome data set of prostate tissue samples. Despite high basal variation, we find that traditional statistical analysis may exaggerate the extent of heterogeneity. We also report that network-based analysis outperforms protein-based feature selection with concomitantly higher cross-validation accuracy. Overall, network-based analysis provides emergent signal that boosts sensitivity while retaining good precision. It is a potential means of circumventing heterogeneity for stable biomarker discovery.
Collapse
|
6
|
Zhao Y, Sue ACH, Goh WWB. Deeper investigation into the utility of functional class scoring in missing protein prediction from proteomics data. J Bioinform Comput Biol 2019; 17:1950013. [DOI: 10.1142/s0219720019500136] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from new proteomics technology (SWATH) and also checked for reproducibility using two independent datasets profiling kidney tissue proteome. We also evaluated the objectivity of the FCS p-value, and followed up on the value of MPP from predicted complexes. Our results suggest that (1) FCS [Formula: see text]-values are non-objective, and are confounded strongly by complex size, (2) best recovery performance do not necessarily lie at standard [Formula: see text]-value cutoffs, (3) while predicted complexes may be used for augmenting MPP, they are inferior to real complexes, and are further confounded by issues relating to network coverage and quality and (4) moderate sized complexes of size 5 to 10 still exhibit considerable instability, we find that FCS works best with big complexes. While FCS is a powerful approach, blind reliance on its non-objective [Formula: see text]-value is ill-advised.
Collapse
Affiliation(s)
- Yaxing Zhao
- School of Pharmaceutical Science and Technology, Tianjin University, No. 92, Weijin Road, 30072 Tianjin, P. R. China
| | - Andrew Chi-Hau Sue
- School of Pharmaceutical Science and Technology, Tianjin University, No. 92, Weijin Road, 30072 Tianjin, P. R. China
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore
| |
Collapse
|
7
|
Zhou L, Wong L, Goh WWB. Understanding missing proteins: a functional perspective. Drug Discov Today 2018; 23:644-651. [DOI: 10.1016/j.drudis.2017.11.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 10/24/2017] [Accepted: 11/13/2017] [Indexed: 01/03/2023]
|
8
|
Begum T, Ghosh TC, Basak S. Systematic Analyses and Prediction of Human Drug Side Effect Associated Proteins from the Perspective of Protein Evolution. Genome Biol Evol 2017; 9:337-350. [PMID: 28391292 PMCID: PMC5499873 DOI: 10.1093/gbe/evw301] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/16/2017] [Indexed: 12/20/2022] Open
Abstract
Identification of various factors involved in adverse drug reactions in target proteins to develop therapeutic drugs with minimal/no side effect is very important. In this context, we have performed a comparative evolutionary rate analyses between the genes exhibiting drug side-effect(s) (SET) and genes showing no side effect (NSET) with an aim to increase the prediction accuracy of SET/NSET proteins using evolutionary rate determinants. We found that SET proteins are more conserved than the NSET proteins. The rates of evolution between SET and NSET protein primarily depend upon their noncomplex (protein complex association number = 0) forming nature, phylogenetic age, multifunctionality, membrane localization, and transmembrane helix content irrespective of their essentiality, total druggability (total number of drugs/target), m-RNA expression level, and tissue expression breadth. We also introduced two novel terms—killer druggability (number of drugs with killing side effect(s)/target), essential druggability (number of drugs targeting essential proteins/target) to explain the evolutionary rate variation between SET and NSET proteins. Interestingly, we noticed that SET proteins are younger than NSET proteins and multifunctional younger SET proteins are candidates of acquiring killing side effects. We provide evidence that higher killer druggability, multifunctionality, and transmembrane helices support the conservation of SET proteins over NSET proteins in spite of their recent origin. By employing all these entities, our Support Vector Machine model predicts human SET/NSET proteins to a high degree of accuracy (∼86%).
Collapse
Affiliation(s)
- Tina Begum
- Bioinformatics Centre, Tripura University, Suryamaninagar, Tripura, India
| | | | - Surajit Basak
- Bioinformatics Centre, Tripura University, Suryamaninagar, Tripura, India.,Department of Molecular Biology & Bioinformatics, Tripura University, Suryamaninagar, Tripura, India
| |
Collapse
|
9
|
Abstract
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms, but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package NetProt, which provides implementations of representative feature-selection methods. NetProt also provides methods for generating simulated differential data and generating pseudocomplexes for complex-based performance benchmarking. The NetProt open source R package is available for download from https://github.com/gohwils/NetProt/releases/ , and online documentation is available at http://rpubs.com/gohwils/204259 .
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University , 92 Weijin Road, Tianjin 300072, China.,School of Biological Sciences, Nanyang Technological University , 60 Nanyang Drive, Singapore 637551.,Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417.,Department of Pathology, National University of Singapore , 5 Lower Kent Ridge Road, Singapore 119074
| |
Collapse
|
10
|
Goh WWB, Wong L. Class-paired Fuzzy SubNETs: A paired variant of the rank-based network analysis family for feature selection based on protein complexes. Proteomics 2017; 17:e1700093. [PMID: 28390171 DOI: 10.1002/pmic.201700093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 04/05/2017] [Indexed: 01/12/2023]
Abstract
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are similar in the sense that they deploy rank-defined weights among proteins per sample. This procedure is known as gene fuzzy scoring. Currently, no RBNA exists for paired-sample scenarios where both control and test tissues originate from the same source (e.g. same patient). It is expected that paired tests, when used appropriately, are more powerful than approaches intended for unpaired samples. We report that the class-paired RBNA, PPFSNET, dominates in both simulated and real data scenarios. Moreover, for the first time, we explicitly incorporate batch-effect resistance as an additional evaluation criterion for feature-selection approaches. Batch effects are class irrelevant variations arising from different handlers or processing times, and can obfuscate analysis. We demonstrate that PPFSNET and an earlier RBNA, PFSNET, are particularly resistant against batch effects, and only select features strongly correlated with class but not batch.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, P. R. China.,Department of Computer Science, National University of Singapore, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore.,Department of Pathology, National University of Singapore, Singapore
| |
Collapse
|
11
|
Goh WWB, Wong L. Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics. BMC Genomics 2017; 18:142. [PMID: 28361693 PMCID: PMC5374662 DOI: 10.1186/s12864-017-3490-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Background In proteomics, batch effects are technical sources of variation that confounds proper analysis, preventing effective deployment in clinical and translational research. Results Using simulated and real data, we demonstrate existing batch effect-correction methods do not always eradicate all batch effects. Worse still, they may alter data integrity, and introduce false positives. Moreover, although Principal component analysis (PCA) is commonly used for detecting batch effects. The principal components (PCs) themselves may be used as differential features, from which relevant differential proteins may be effectively traced. Batch effect are removable by identifying PCs highly correlated with batch but not class effect. However, neither PC-based nor existing batch effect-correction methods address well subtle batch effects, which are difficult to eradicate, and involve data transformation and/or projection which is error-prone. To address this, we introduce the concept of batch-effect resistant methods and demonstrate how such methods incorporating protein complexes are particularly resistant to batch effect without compromising data integrity. Conclusions Protein complex-based analyses are powerful, offering unparalleled differential protein-selection reproducibility and high prediction accuracy. We demonstrate for the first time their innate resistance against batch effects, even subtle ones. As complex-based analyses require no prior data transformation (e.g. batch-effect correction), data integrity is protected. Individual checks on top-ranked protein complexes confirm strong association with phenotype classes and not batch. Therefore, the constituent proteins of these complexes are more likely to be clinically relevant. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3490-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin, 300072, People's Republic of China. .,Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore.
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore. .,Department of Pathology, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
12
|
Lee PKM, Goh WWB, Sng JCG. Network-based characterization of the synaptic proteome reveals that removal of epigenetic regulator Prmt8 restricts proteins associated with synaptic maturation. J Neurochem 2017; 140:613-628. [PMID: 27935040 DOI: 10.1111/jnc.13921] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2016] [Revised: 11/30/2016] [Accepted: 12/04/2016] [Indexed: 12/13/2022]
Abstract
The brain adapts to dynamic environmental conditions by altering its epigenetic state, thereby influencing neuronal transcriptional programs. An example of an epigenetic modification is protein methylation, catalyzed by protein arginine methyltransferases (PRMT). One member, Prmt8, is selectively expressed in the central nervous system during a crucial phase of early development, but little else is known regarding its function. We hypothesize Prmt8 plays a role in synaptic maturation during development. To evaluate this, we used a proteome-wide approach to characterize the synaptic proteome of Prmt8 knockout versus wild-type mice. Through comparative network-based analyses, proteins and functional clusters related to neurite development were identified to be differentially regulated between the two genotypes. One interesting protein that was differentially regulated was tenascin-R (TNR). Chromatin immunoprecipitation demonstrated binding of PRMT8 to the tenascin-r (Tnr) promoter. TNR, a component of perineuronal nets, preserves structural integrity of synaptic connections within neuronal networks during the development of visual-somatosensory cortices. On closer inspection, Prmt8 removal increased net formation and decreased inhibitory parvalbumin-positive (PV+) puncta on pyramidal neurons, thereby hindering the maturation of circuits. Consequently, visual acuity of the knockout mice was reduced. Our results demonstrated Prmt8's involvement in synaptic maturation and its prospect as an epigenetic modulator of developmental neuroplasticity by regulating structural elements such as the perineuronal nets.
Collapse
Affiliation(s)
- Patrick Kia Ming Lee
- Integrative Neuroscience Program, Singapore Institute for Clinical Sciences, Agency for Science Technology and Research (A*STAR), Singapore.,Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.,School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, China
| | - Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, China.,Department of Computer Science, National University of Singapore, Singapore
| | - Judy Chia Ghee Sng
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| |
Collapse
|
13
|
Abstract
Background Gene expression data produced on high-throughput platforms such as microarrays is susceptible to much variation that obscures useful biological information. Therefore, preprocessing data with a suitable normalization method is necessary, and has a direct and massive impact on the quality of downstream data analysis. However, it is known that standard normalization methods perform poorly, specially in the presence of substantial batch effects and heterogeneity in gene expression data. Results We present Gene Fuzzy Score (GFS), a simple preprocessing technique, that is able to largely reduce obscuring variation while retaining useful biological information. Using four sets of publicly available datasets containing batch effects and heterogeneity, we compare GFS with three standard normalization techniques as well as raw gene expression. Each method is evaluated with respect to the quality, consistency, and biological coherence of its processed output. It is found that GFS outperforms other transformation techniques in all three aspects. Conclusion Our approach to preprocessing is a stronger alternative to popular normalization techniques. We demonstrate that it achieves the essential goal of preprocessing – it is effective at making expression values from multiple samples comparable, even when they are from separate platforms, in independent batches, or belong to a heterogeneous phenotype.
Collapse
Affiliation(s)
- Abha Belorkar
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore, 117417, Republic of Singapore.
| | - Limsoon Wong
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore, 117417, Republic of Singapore
| |
Collapse
|
14
|
Wang W, Sue ACH, Goh WWB. Feature selection in clinical proteomics: with great power comes great reproducibility. Drug Discov Today 2016; 22:912-918. [PMID: 27988358 DOI: 10.1016/j.drudis.2016.12.006] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Revised: 11/27/2016] [Accepted: 12/08/2016] [Indexed: 01/17/2023]
Abstract
In clinical proteomics, reproducible feature selection is unattainable given the standard statistical hypothesis-testing framework. This leads to irreproducible signatures with no diagnostic power. Instability stems from high P-value variability (p_var), which is inevitable and insolvable. The impact of p_var can be reduced via power increment, for example increasing sample size and measurement accuracy. However, these are not realistic solutions in practice. Instead, workarounds using existing data such as signal boosting transformation techniques and network-based statistical testing is more practical. Furthermore, it is useful to consider other metrics alongside P-values including confidence intervals, effect sizes and cross-validation accuracies to make informed inferences.
Collapse
Affiliation(s)
- Wei Wang
- School of Pharmaceutical Science and Technology, Tianjin University, China
| | - Andrew C-H Sue
- School of Pharmaceutical Science and Technology, Tianjin University, China
| | - Wilson W B Goh
- School of Pharmaceutical Science and Technology, Tianjin University, China; Department of Bioengineering, Tianjin University, China; Department of Computer Science, National University of Singapore, Singapore.
| |
Collapse
|
15
|
Goh WWB. Fuzzy-FishNET: a highly reproducible protein complex-based approach for feature selection in comparative proteomics. BMC Med Genomics 2016; 9:67. [PMID: 28117654 PMCID: PMC5260792 DOI: 10.1186/s12920-016-0228-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Background The hypergeometric enrichment analysis approach typically fares poorly in feature-selection stability due to its upstream reliance on the t-test to generate differential protein lists before testing for enrichment on a protein complex, subnetwork or gene group. Methods Swapping the t-test in favour of a fuzzy rank-based weight system similar to that used in network-based methods like Quantitative Proteomics Signature Profiling (QPSP), Fuzzy SubNets (FSNET) and paired FSNET (PFSNET) produces dramatic improvements. Results This approach, Fuzzy-FishNET, exhibits high precision-recall over three sets of simulated data (with simulated protein complexes) while excelling in feature-selection reproducibility on real data (based on evaluation with real protein complexes). Overlap comparisons with PFSNET shows Fuzzy-FishNET selects the most significant complexes, which are also strongly class-discriminative. Cross-validation further demonstrates Fuzzy-FishNET selects class-relevant protein complexes. Conclusions Based on evaluation with simulated and real datasets, Fuzzy-FishNET is a significant upgrade of the traditional hypergeometric enrichment approach and a powerful new entrant amongst comparative proteomics analysis methods. Electronic supplementary material The online version of this article (doi:10.1186/s12920-016-0228-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, People's Republic of China.
| |
Collapse
|
16
|
Goh WWB, Wong L. Integrating Networks and Proteomics: Moving Forward. Trends Biotechnol 2016; 34:951-959. [DOI: 10.1016/j.tibtech.2016.05.015] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Revised: 05/23/2016] [Accepted: 05/24/2016] [Indexed: 11/28/2022]
|
17
|
Goh WWB, Wong L. Spectra-first feature analysis in clinical proteomics — A case study in renal cancer. J Bioinform Comput Biol 2016; 14:1644004. [DOI: 10.1142/s0219720016440042] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In proteomics, useful signal may be unobserved or lost due to the lack of confident peptide-spectral matches. Selection of differential spectra, followed by associative peptide/protein mapping may be a complementary strategy for improving sensitivity and comprehensiveness of analysis (spectra-first paradigm). This approach is complementary to the standard approach where functional analysis is performed only on the finalized protein list assembled from identified peptides from the spectra (protein-first paradigm). Based on a case study of renal cancer, we introduce a simple spectra-binning approach, MZ-bin. We demonstrate that differential spectra feature selection using MZ-bin is class-discriminative and can trace relevant proteins via spectra associative mapping. Moreover, proteins identified in this manner are more biologically coherent than those selected directly from the finalized protein list. Analysis of constituent peptides per protein reveals high expression inconsistency, suggesting that the measured protein expressions are in fact, poor approximations of true protein levels. Moreover, analysis at the level of constituent peptides may provide higher resolution insight into the underlying biology: Via MZ-bin, we identified for the first time differential splice forms for the known renal cancer marker MAPT. We conclude that the spectra-first analysis paradigm is a complementary strategy to the traditional protein-first paradigm and can provide deeper level insight.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Tianjin 300072, P. R. China
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417 Singapore
| |
Collapse
|
18
|
Goh WWB, Wong L. Advancing Clinical Proteomics via Analysis Based on Biological Complexes: A Tale of Five Paradigms. J Proteome Res 2016; 15:3167-79. [DOI: 10.1021/acs.jproteome.6b00402] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Wilson Wen Bin Goh
- School
of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China
- Department
of Computer Science, National University of Singapore, 13 Computing
Drive, Singapore 117417
| | - Limsoon Wong
- Department
of Computer Science, National University of Singapore, 13 Computing
Drive, Singapore 117417
- Department
of Pathology, National University of Singapore, 5 Lower Kent Ridge Road, Singapore 117417
| |
Collapse
|
19
|
Goh WWB, Wong L. Evaluating feature-selection stability in next-generation proteomics. J Bioinform Comput Biol 2016; 14:1650029. [PMID: 27640811 DOI: 10.1142/s0219720016500293] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. [Formula: see text]-test and recursive feature elimination. Moreover, due to high test variability, selecting the top proteins based on [Formula: see text]-value ranks - even when restricted to high-abundance proteins - does not improve reproducibility. Statistical testing based on networks are believed to be more robust, but this does not always hold true: The commonly used hypergeometric enrichment that tests for enrichment of protein subnets performs abysmally due to its dependence on unstable protein pre-selection steps. We demonstrate here for the first time the utility of a novel suite of network-based algorithms called ranked-based network algorithms (RBNAs) on proteomics. These have originally been introduced and tested extensively on genomics data. We show here that they are highly stable, reproducible and select relevant features when applied to proteomics data. It is also evident from these results that use of statistical feature testing on protein expression data should be executed with due caution. Careless use of networks does not resolve poor-performance issues, and can even mislead. We recommend augmenting statistical feature-selection methods with concurrent analysis on stability and reproducibility to improve the quality of the selected features prior to experimental validation.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- 1 School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Tianjin 300072, China.,2 Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore 117417 Singapore
| | - Limsoon Wong
- 1 School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Tianjin 300072, China.,2 Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore 117417 Singapore
| |
Collapse
|
20
|
Design principles for clinical network-based proteomics. Drug Discov Today 2016; 21:1130-8. [DOI: 10.1016/j.drudis.2016.05.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2015] [Revised: 04/18/2016] [Accepted: 05/20/2016] [Indexed: 01/10/2023]
|