1
|
Kabir MN, Wang LR, Goh WWB. Exploiting the similarity of dissimilarities for biomedical applications and enhanced machine learning. PLoS Comput Biol 2025; 21:e1012716. [PMID: 39854337 PMCID: PMC11759369 DOI: 10.1371/journal.pcbi.1012716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2025] Open
Abstract
The "similarity of dissimilarities" is an emerging paradigm in biomedical science with significant implications for protein function prediction, machine learning (ML), and personalized medicine. In protein function prediction, recognizing dissimilarities alongside similarities provides a more detailed understanding of evolutionary processes, allowing for a deeper exploration of regions that influence biological functionality. For ML models, incorporating dissimilarity measures helps avoid misleading results caused by highly correlated or similar data, addressing confounding issues like the Doppelgänger Effect. This leads to more accurate insights and a stronger understanding of complex biological systems. In the realm of personalized AI and precision medicine, the importance of dissimilarities is paramount. Personalized AI builds local models for each sample by identifying a network of neighboring samples. However, if the neighboring samples are too similar, it becomes difficult to identify factors critical to disease onset for the individual, limiting the effectiveness of personalized interventions or treatments. This paper discusses the "similarity of dissimilarities" concept, using protein function prediction, ML, and personalized AI as key examples. Integrating this approach into an analysis allows for the design of better, more meaningful experiments and the development of smarter validation methods, ensuring that the models learn in a meaningful way.
Collapse
Affiliation(s)
- Mohammad Neamul Kabir
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| | - Li Rong Wang
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
- Center of AI in Medicine, Nanyang Technological University, Singapore, Singapore
- Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
2
|
Nickerson JL, Gagnon H, Wentzell PD, Doucette AA. Assessing the precision of a detergent-assisted cartridge precipitation workflow for non-targeted quantitative proteomics. Proteomics 2024; 24:e2300339. [PMID: 38299459 DOI: 10.1002/pmic.202300339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 01/08/2024] [Accepted: 01/12/2024] [Indexed: 02/02/2024]
Abstract
Detergent-based workflows incorporating sodium dodecyl sulfate (SDS) necessitate additional steps for detergent removal ahead of mass spectrometry (MS). These steps may lead to variable protein recovery, inconsistent enzyme digestion efficiency, and unreliable MS signals. To validate a detergent-based workflow for quantitative proteomics, we herein evaluate the precision of a bottom-up sample preparation strategy incorporating cartridge-based protein precipitation with organic solvent to deplete SDS. The variance of data-independent acquisition (SWATH-MS) data was isolated from sample preparation error by modelling the variance as a function of peptide signal intensity. Our SDS-assisted cartridge workflow yield a coefficient of variance (CV) of 13%-14%. By comparison, conventional (detergent-free) in-solution digestion increased the CV to 50%; in-gel digestion provided lower CVs between 14% and 20%. By filtering peptides predicting to display lower precision, we further enhance the validity of data in global comparative proteomics. These results demonstrate the detergent-based precipitation workflow is a reliable approach for in depth, label-free quantitative proteome analysis.
Collapse
Affiliation(s)
| | - Hugo Gagnon
- PhenoSwitch Bioscience Inc., Sherbrooke, Quebec, Canada
| | - Peter D Wentzell
- Department of Chemistry, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Alan A Doucette
- Department of Chemistry, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
3
|
Bernardo L, Lomagno A, Mauri PL, Di Silvestre D. Integration of Omics Data and Network Models to Unveil Negative Aspects of SARS-CoV-2, from Pathogenic Mechanisms to Drug Repurposing. BIOLOGY 2023; 12:1196. [PMID: 37759595 PMCID: PMC10525644 DOI: 10.3390/biology12091196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 08/25/2023] [Accepted: 08/30/2023] [Indexed: 09/29/2023]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the COVID-19 health emergency, affecting and killing millions of people worldwide. Following SARS-CoV-2 infection, COVID-19 patients show a spectrum of symptoms ranging from asymptomatic to very severe manifestations. In particular, bronchial and pulmonary cells, involved at the initial stage, trigger a hyper-inflammation phase, damaging a wide range of organs, including the heart, brain, liver, intestine and kidney. Due to the urgent need for solutions to limit the virus' spread, most efforts were initially devoted to mapping outbreak trajectories and variant emergence, as well as to the rapid search for effective therapeutic strategies. Samples collected from hospitalized or dead COVID-19 patients from the early stages of pandemic have been analyzed over time, and to date they still represent an invaluable source of information to shed light on the molecular mechanisms underlying the organ/tissue damage, the knowledge of which could offer new opportunities for diagnostics and therapeutic designs. For these purposes, in combination with clinical data, omics profiles and network models play a key role providing a holistic view of the pathways, processes and functions most affected by viral infection. In fact, in addition to epidemiological purposes, networks are being increasingly adopted for the integration of multiomics data, and recently their use has expanded to the identification of drug targets or the repositioning of existing drugs. These topics will be covered here by exploring the landscape of SARS-CoV-2 survey-based studies using systems biology approaches derived from omics data, paying particular attention to those that have considered samples of human origin.
Collapse
Affiliation(s)
| | | | | | - Dario Di Silvestre
- Institute for Biomedical Technologies—National Research Council (ITB-CNR), 20054 Segrate, Italy; (L.B.); (A.L.); (P.L.M.)
| |
Collapse
|
4
|
Zhou Y, Zhang Y, Li F, Lian X, Zhu Q, Zhu F, Qiu Y. SISPRO: signature identification for spatial proteomics. J Mol Biol 2023. [DOI: 10.1016/j.jmb.2022.167944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
5
|
Li F, Zhou Y, Zhang Y, Yin J, Qiu Y, Gao J, Zhu F. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform 2022; 23:6532538. [PMID: 35183059 DOI: 10.1093/bib/bbac040] [Citation(s) in RCA: 91] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 01/21/2022] [Accepted: 01/27/2022] [Indexed: 12/17/2022] Open
Abstract
Mass spectrometry-based proteomic technique has become indispensable in current exploration of complex and dynamic biological processes. Instrument development has largely ensured the effective production of proteomic data, which necessitates commensurate advances in statistical framework to discover the optimal proteomic signature. Current framework mainly emphasizes the generalizability of the identified signature in predicting the independent data but neglects the reproducibility among signatures identified from independently repeated trials on different sub-dataset. These problems seriously restricted the wide application of the proteomic technique in molecular biology and other related directions. Thus, it is crucial to enable the generalizable and reproducible discovery of the proteomic signature with the subsequent indication of phenotype association. However, no such tool has been developed and available yet. Herein, an online tool, POSREG, was therefore constructed to identify the optimal signature for a set of proteomic data. It works by (i) identifying the proteomic signature of good reproducibility and aggregating them to ensemble feature ranking by ensemble learning, (ii) assessing the generalizability of ensemble feature ranking to acquire the optimal signature and (iii) indicating the phenotype association of discovered signature. POSREG is unique in its capacity of discovering the proteomic signature by simultaneously optimizing its reproducibility and generalizability. It is now accessible free of charge without any registration or login requirement at https://idrblab.org/posreg/.
Collapse
Affiliation(s)
- Fengcheng Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ying Zhou
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China
| | - Ying Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yunqing Qiu
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China
| | - Jianqing Gao
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
6
|
Fu J, Luo Y, Mou M, Zhang H, Tang J, Wang Y, Zhu F. Advances in Current Diabetes Proteomics: From the Perspectives of Label- free Quantification and Biomarker Selection. Curr Drug Targets 2021; 21:34-54. [PMID: 31433754 DOI: 10.2174/1389450120666190821160207] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/17/2019] [Accepted: 07/24/2019] [Indexed: 12/13/2022]
Abstract
BACKGROUND Due to its prevalence and negative impacts on both the economy and society, the diabetes mellitus (DM) has emerged as a worldwide concern. In light of this, the label-free quantification (LFQ) proteomics and diabetic marker selection methods have been applied to elucidate the underlying mechanisms associated with insulin resistance, explore novel protein biomarkers, and discover innovative therapeutic protein targets. OBJECTIVE The purpose of this manuscript is to review and analyze the recent computational advances and development of label-free quantification and diabetic marker selection in diabetes proteomics. METHODS Web of Science database, PubMed database and Google Scholar were utilized for searching label-free quantification, computational advances, feature selection and diabetes proteomics. RESULTS In this study, we systematically review the computational advances of label-free quantification and diabetic marker selection methods which were applied to get the understanding of DM pathological mechanisms. Firstly, different popular quantification measurements and proteomic quantification software tools which have been applied to the diabetes studies are comprehensively discussed. Secondly, a number of popular manipulation methods including transformation, pretreatment (centering, scaling, and normalization), missing value imputation methods and a variety of popular feature selection techniques applied to diabetes proteomic data are overviewed with objective evaluation on their advantages and disadvantages. Finally, the guidelines for the efficient use of the computationbased LFQ technology and feature selection methods in diabetes proteomics are proposed. CONCLUSION In summary, this review provides guidelines for researchers who will engage in proteomics biomarker discovery and by properly applying these proteomic computational advances, more reliable therapeutic targets will be found in the field of diabetes mellitus.
Collapse
Affiliation(s)
- Jianbo Fu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| |
Collapse
|
7
|
Khan MJ, Desaire H, Lopez OL, Kamboh MI, Robinson RA. Why Inclusion Matters for Alzheimer's Disease Biomarker Discovery in Plasma. J Alzheimers Dis 2021; 79:1327-1344. [PMID: 33427747 PMCID: PMC9126484 DOI: 10.3233/jad-201318] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
BACKGROUND African American/Black adults have a disproportionate incidence of Alzheimer's disease (AD) and are underrepresented in biomarker discovery efforts. OBJECTIVE This study aimed to identify potential diagnostic biomarkers for AD using a combination of proteomics and machine learning approaches in a cohort that included African American/Black adults. METHODS We conducted a discovery-based plasma proteomics study on plasma samples (N = 113) obtained from clinically diagnosed AD and cognitively normal adults that were self-reported African American/Black or non-Hispanic White. Sets of differentially-expressed proteins were then classified using a support vector machine (SVM) to identify biomarker candidates. RESULTS In total, 740 proteins were identified of which, 25 differentially-expressed proteins in AD came from comparisons within a single racial and ethnic background group. Six proteins were differentially-expressed in AD regardless of racial and ethnic background. Supervised classification by SVM yielded an area under the curve (AUC) of 0.91 and accuracy of 86%for differentiating AD in samples from non-Hispanic White adults when trained with differentially-expressed proteins unique to that group. However, the same model yielded an AUC of 0.49 and accuracy of 47%for differentiating AD in samples from African American/Black adults. Other covariates such as age, APOE4 status, sex, and years of education were found to improve the model mostly in the samples from non-Hispanic White adults for classifying AD. CONCLUSION These results demonstrate the importance of study designs in AD biomarker discovery, which must include diverse racial and ethnic groups such as African American/Black adults to develop effective biomarkers.
Collapse
Affiliation(s)
- Mostafa J. Khan
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
| | - Heather Desaire
- Department of Chemistry, University of Kansas, Lawrence, KS, USA
| | - Oscar L. Lopez
- Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA, USA
| | - M. Ilyas Kamboh
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Epidemiology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Renã A.S. Robinson
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Memory and Alzheimer’s Center, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Institute of Chemical Biology, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Brain Institute, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Neurology, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
8
|
Gakii C, Rimiru R. Identification of cancer related genes using feature selection and association rule mining. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100595] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
|
9
|
Fernández-Costa C, Martínez-Bartolomé S, McClatchy DB, Saviola AJ, Yu NK, Yates JR. Impact of the Identification Strategy on the Reproducibility of the DDA and DIA Results. J Proteome Res 2020; 19:3153-3161. [PMID: 32510229 PMCID: PMC7898222 DOI: 10.1021/acs.jproteome.0c00153] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Data-independent acquisition (DIA) is a promising technique for the proteomic analysis of complex protein samples. A number of studies have claimed that DIA experiments are more reproducible than data-dependent acquisition (DDA), but these claims are unsubstantiated since different data analysis methods are used in the two methods. Data analysis in most DIA workflows depends on spectral library searches, whereas DDA typically employs sequence database searches. In this study, we examined the reproducibility of the DIA and DDA results using both sequence database and spectral library search. The comparison was first performed using a cell lysate and then extended to an interactome study. Protein overlap among the technical replicates in both DDA and DIA experiments was 30% higher with library-based identifications than with sequence database identifications. The reproducibility of quantification was also improved with library search compared to database search, with the mean of the coefficient of variation decreasing more than 30% and a reduction in the number of missing values of more than 35%. Our results show that regardless of the acquisition method, higher identification and quantification reproducibility is observed when library search was used.
Collapse
Affiliation(s)
- Carolina Fernández-Costa
- Departments of Molecular Medicine & Neurobiology, The Scripps Research Institute, La Jolla, CA, USA
| | | | - Daniel B. McClatchy
- Departments of Molecular Medicine & Neurobiology, The Scripps Research Institute, La Jolla, CA, USA
| | - Anthony J. Saviola
- Departments of Molecular Medicine & Neurobiology, The Scripps Research Institute, La Jolla, CA, USA
| | - Nam-Kyung Yu
- Departments of Molecular Medicine & Neurobiology, The Scripps Research Institute, La Jolla, CA, USA
| | - John R. Yates
- Departments of Molecular Medicine & Neurobiology, The Scripps Research Institute, La Jolla, CA, USA
| |
Collapse
|
10
|
Quantitative proteomics to study aging in rabbit liver. Mech Ageing Dev 2020; 187:111227. [PMID: 32126221 DOI: 10.1016/j.mad.2020.111227] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2019] [Revised: 01/24/2020] [Accepted: 02/27/2020] [Indexed: 12/23/2022]
Abstract
Aging globally effects cellular and organismal metabolism across a range of mammalian species, including humans and rabbits. Rabbits (Oryctolagus cuniculus are an attractive model system of aging due to their genetic similarity with humans and their short lifespans. This model can be used to understand metabolic changes in aging especially in major organs such as liver where we detected pronounced variations in fat metabolism, mitochondrial dysfunction, and protein degradation. Such changes in the liver are consistent across several mammalian species however in rabbits the downstream effects of these changes have not yet been explored. We have applied proteomics to study changes in the liver proteins from young, middle, and old age rabbits using a multiplexing cPILOT strategy. This resulted in the identification of 2,586 liver proteins, among which 45 proteins had significant p < 0.05) changes with aging. Seven proteins were differentially-expressed at all ages and include fatty acid binding protein, aldehyde dehydrogenase, enoyl-CoA hydratase, 3-hydroxyacyl CoA dehydrogenase, apolipoprotein C3, peroxisomal sarcosine oxidase, adhesion G-protein coupled receptor, and glutamate ionotropic receptor kinate. Insights to how alterations in metabolism affect protein expression in liver have been gained and demonstrate the utility of rabbit as a model of aging.
Collapse
|
11
|
King CD, Robinson RAS. Evaluating Combined Precursor Isotopic Labeling and Isobaric Tagging Performance on Orbitraps To Study the Peripheral Proteome of Alzheimer's Disease. Anal Chem 2020; 92:2911-2916. [PMID: 31940168 PMCID: PMC7932850 DOI: 10.1021/acs.analchem.9b01974] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Combined precursor isotopic labeling and isobaric tagging (cPILOT) is an enhanced multiplexing strategy currently capable of analyzing up to 24 samples simultaneously. This capability is especially helpful when studying multiple tissues and biological replicates in models of disease, such as Alzheimer's disease (AD). Here, cPILOT was used to study proteomes from heart, liver, and brain tissues in a late-stage amyloid precursor protein/presenilin-1 (APP/PS-1) human transgenic double-knock-in mouse model of AD. The original global cPILOT assay developed on an Orbitrap Velos instrument was transitioned to an Orbitrap Fusion Lumos instrument. The advantages of faster scan rates, lower limits of detection, and synchronous precursor selection on the Fusion Lumos afford greater numbers of isobarically tagged peptides to be quantified in comparison to the Orbitrap Velos. Parameters such as LC gradient, m/z isolation window, dynamic exclusion, targeted mass analyses, and synchronous precursor scan were optimized leading to >600 000 PSMs, corresponding to 6074 proteins. Overall, these studies inform of system-wide changes in brain, heart, and liver proteins from a mouse model of AD.
Collapse
Affiliation(s)
- Christina D King
- Department of Chemistry , Vanderbilt University , Nashville , Tennessee 37235 , United States
| | - Renã A S Robinson
- Department of Chemistry , Vanderbilt University , Nashville , Tennessee 37235 , United States
- Department of Neurology , Vanderbilt University Medical Center , Nashville , Tennessee 37232 , United States
- Vanderbilt Memory & Alzheimer's Center , Vanderbilt University Medical Center , Nashville , Tennessee 37212 , United States
- Vanderbilt Institute of Chemical Biology , Vanderbilt University , Nashville , Tennessee 37232 , United States
- Vanderbilt Brain Institute , Vanderbilt University , Nashville , Tennessee 37232 , United States
| |
Collapse
|
12
|
Goh WWB, Wong L. Advanced bioinformatics methods for practical applications in proteomics. Brief Bioinform 2019; 20:347-355. [PMID: 30657890 DOI: 10.1093/bib/bbx128] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Indexed: 12/22/2022] Open
Abstract
Mass spectrometry (MS)-based proteomics has undergone rapid advancements in recent years, creating challenging problems for bioinformatics. We focus on four aspects where bioinformatics plays a crucial role (and proteomics is needed for clinical application): peptide-spectra matching (PSM) based on the new data-independent acquisition (DIA) paradigm, resolving missing proteins (MPs), dealing with biological and technical heterogeneity in data and statistical feature selection (SFS). DIA is a brute-force strategy that provides greater width and depth but, because it indiscriminately captures spectra such that signal from multiple peptides is mixed, getting good PSMs is difficult. We consider two strategies: simplification of DIA spectra to pseudo-data-dependent acquisition spectra or, alternatively, brute-force search of each DIA spectra against known reference libraries. The MP problem arises when proteins are never (or inconsistently) detected by MS. When observed in at least one sample, imputation methods can be used to guess the approximate protein expression level. If never observed at all, network/protein complex-based contextualization provides an independent prediction platform. Data heterogeneity is a difficult problem with two dimensions: technical (batch effects), which should be removed, and biological (including demography and disease subpopulations), which should be retained. Simple normalization is seldom sufficient, while batch effect-correction algorithms may create errors. Batch effect-resistant normalization methods are a viable alternative. Finally, SFS is vital for practical applications. While many methods exist, there is no best method, and both upstream (e.g. normalization) and downstream processing (e.g. multiple-testing correction) are performance confounders. We also discuss signal detection when class effects are weak.
Collapse
|
13
|
Liang Y, Kelemen A, Kelemen A. Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies. Stat Appl Genet Mol Biol 2019; 18:sagmb-2018-0039. [PMID: 31077580 DOI: 10.1515/sagmb-2018-0039] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore, MD 21201-1579, USA
| | - Adam Kelemen
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | - Arpad Kelemen
- Department of Organizational Systems and Adult Health, University of Maryland, Baltimore, MD 21201-1579, USA
| |
Collapse
|
14
|
Zhao Y, Sue ACH, Goh WWB. Deeper investigation into the utility of functional class scoring in missing protein prediction from proteomics data. J Bioinform Comput Biol 2019; 17:1950013. [DOI: 10.1142/s0219720019500136] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from new proteomics technology (SWATH) and also checked for reproducibility using two independent datasets profiling kidney tissue proteome. We also evaluated the objectivity of the FCS p-value, and followed up on the value of MPP from predicted complexes. Our results suggest that (1) FCS [Formula: see text]-values are non-objective, and are confounded strongly by complex size, (2) best recovery performance do not necessarily lie at standard [Formula: see text]-value cutoffs, (3) while predicted complexes may be used for augmenting MPP, they are inferior to real complexes, and are further confounded by issues relating to network coverage and quality and (4) moderate sized complexes of size 5 to 10 still exhibit considerable instability, we find that FCS works best with big complexes. While FCS is a powerful approach, blind reliance on its non-objective [Formula: see text]-value is ill-advised.
Collapse
Affiliation(s)
- Yaxing Zhao
- School of Pharmaceutical Science and Technology, Tianjin University, No. 92, Weijin Road, 30072 Tianjin, P. R. China
| | - Andrew Chi-Hau Sue
- School of Pharmaceutical Science and Technology, Tianjin University, No. 92, Weijin Road, 30072 Tianjin, P. R. China
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore
| |
Collapse
|
15
|
Lualdi M, Fasano M. Statistical analysis of proteomics data: A review on feature selection. J Proteomics 2019; 198:18-26. [DOI: 10.1016/j.jprot.2018.12.004] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 11/27/2018] [Accepted: 12/05/2018] [Indexed: 12/19/2022]
|
16
|
Zhou LT, Lv LL, Liu BC. Urinary Biomarkers of Renal Fibrosis. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1165:607-623. [PMID: 31399987 DOI: 10.1007/978-981-13-8871-2_30] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Renal fibrosis is the common pathological pathway of progressive CKD. The commonly used biomarkers in clinical practice are not optimal to detect injury or predict prognosis. Therefore, it is crucial to develop novel biomarkers to allow prompt intervention. Urine serves as a valuable resource of biomarker discovery for kidney diseases. Owing to the rapid development of omics platforms and bioinformatics, research on novel urinary biomarkers for renal fibrosis has proliferated in recent years. In this chapter, we discuss the current status and provide basic knowledge in this field. We present novel promising biomarkers including tubular injury markers, proteins related to activated inflammation/fibrosis pathways, CKD273, transcriptomic biomarkers, as well as metabolomic biomarkers. Furthermore, considering the complex nature of the pathogenesis of renal fibrosis, we also highlight the combination of biomarkers to further improve the diagnostic and prognostic performance.
Collapse
Affiliation(s)
- Le-Ting Zhou
- Institute of Nephrology, Zhong Da Hospital, Southeast University School of Medicine, DingJiaQiao Road, Nanjing, China
| | - Lin-Li Lv
- Institute of Nephrology, Zhong Da Hospital, Southeast University School of Medicine, DingJiaQiao Road, Nanjing, China
| | - Bi-Cheng Liu
- Institute of Nephrology, Zhong Da Hospital, Southeast University School of Medicine, DingJiaQiao Road, Nanjing, China.
| |
Collapse
|
17
|
Goh WWB, Wong L. Turning straw into gold: building robustness into gene signature inference. Drug Discov Today 2018; 24:31-36. [PMID: 30081096 DOI: 10.1016/j.drudis.2018.08.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2018] [Revised: 07/23/2018] [Accepted: 08/01/2018] [Indexed: 12/29/2022]
Abstract
Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (GAPs) are needed, namely: leveraging existing data and knowledge; careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible. Here, using a re-examination of a breast cancer data set and 48 published signatures, we illustrate the value of adopting these GAPs.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore.
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore; Department of Pathology, National University of Singapore, Singapore.
| |
Collapse
|
18
|
Why breast cancer signatures are no better than random signatures explained. Drug Discov Today 2018; 23:1818-1823. [PMID: 29864526 DOI: 10.1016/j.drudis.2018.05.036] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Revised: 05/14/2018] [Accepted: 05/29/2018] [Indexed: 12/30/2022]
Abstract
Random signature superiority (RSS) occurs when random gene signatures outperform published and/or known signatures. Unlike reproducibility and generalizability issues, RSS is relatively underexplored. Yet, understanding it is imperative for better analytical outcome. In breast cancer, RSS correlates strongly with enrichment for proliferation genes and signature size. Removal of proliferation genes from random signatures reduces the predictive power of random signatures. Almost all genes are correlated to a certain extent with the proliferation signature, making complete elimination of its confounding effects impossible. RSS goes beyond breast cancer, because it also exists in other diseases; it is especially strong in other cancers in a platform-independent manner, and less severe, but present nonetheless, in nonproliferative diseases.
Collapse
|
19
|
Zhou L, Wong L, Goh WWB. Understanding missing proteins: a functional perspective. Drug Discov Today 2018; 23:644-651. [DOI: 10.1016/j.drudis.2017.11.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 10/24/2017] [Accepted: 11/13/2017] [Indexed: 01/03/2023]
|
20
|
Liang S, Ma A, Yang S, Wang Y, Ma Q. A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis. Comput Struct Biotechnol J 2018; 16:88-97. [PMID: 30275937 PMCID: PMC6158772 DOI: 10.1016/j.csbj.2018.02.005] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2017] [Revised: 02/14/2018] [Accepted: 02/19/2018] [Indexed: 12/31/2022] Open
Abstract
With the rapid accumulation of gene expression data from various technologies, e.g., microarray, RNA-sequencing (RNA-seq), and single-cell RNA-seq, it is necessary to carry out dimensional reduction and feature (signature genes) selection in support of making sense out of such high dimensional data. These computational methods significantly facilitate further data analysis and interpretation, such as gene function enrichment analysis, cancer biomarker detection, and drug targeting identification in precision medicine. Although numerous methods have been developed for feature selection in bioinformatics, it is still a challenge to choose the appropriate methods for a specific problem and seek for the most reasonable ranking features. Meanwhile, the paired gene expression data under matched case-control design (MCCD) is becoming increasingly popular, which has often been used in multi-omics integration studies and may increase feature selection efficiency by offsetting similar distributions of confounding features. The appropriate feature selection methods specifically designed for the paired data, which is named as matched-pairs feature selection (MPFS), however, have not been maturely developed in parallel. In this review, we compare the performance of 10 feature-selection methods (eight MPFS methods and two traditional unpaired methods) on two real datasets by applied three classification methods, and analyze the algorithm complexity of these methods through the running of their programs. This review aims to induce and comprehensively present the MPFS in such a way that readers can easily understand its characteristics and get a clue in selecting the appropriate methods for their analyses.
Collapse
Affiliation(s)
- Sen Liang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Anjun Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD 57007, USA.,BioSNTR, Brookings, SD, USA
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD 57007, USA.,BioSNTR, Brookings, SD, USA
| |
Collapse
|
21
|
Dealing with Confounders in Omics Analysis. Trends Biotechnol 2018; 36:488-498. [PMID: 29475622 DOI: 10.1016/j.tibtech.2018.01.013] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 01/28/2018] [Accepted: 01/29/2018] [Indexed: 01/05/2023]
Abstract
The Anna Karenina effect is a manifestation of the theory-practice gap that exists when theoretical statistics are applied on real-world data. In the course of analyzing biological data for differential features such as genes or proteins, it derives from the situation where the null hypothesis is rejected for extraneous reasons (or confounders), rather than because the alternative hypothesis is relevant to the disease phenotype. The mechanics of applying statistical tests therefore must address and resolve confounders. It is inadequate to simply rely on manipulating the P-value. We discuss three mechanistic elements (hypothesis statement construction, null distribution appropriateness, and test-statistic construction) and suggest how they can be designed to foil the Anna Karenina effect to select phenotypically relevant biological features.
Collapse
|
22
|
Goh WWB, Sng JCG, Yee JY, See YM, Lee TS, Wong L, Lee J. Can Peripheral Blood-Derived Gene Expressions Characterize Individuals at Ultra-high Risk for Psychosis? COMPUTATIONAL PSYCHIATRY (CAMBRIDGE, MASS.) 2017; 1:168-183. [PMID: 30090857 PMCID: PMC6067827 DOI: 10.1162/cpsy_a_00007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/07/2017] [Indexed: 12/17/2022]
Abstract
The ultra-high risk (UHR) state was originally conceived to identify individuals at imminent risk of developing psychosis. Although recent studies have suggested that most individuals designated UHR do not, they constitute a distinctive group, exhibiting cognitive and functional impairments alongside multiple psychiatric morbidities. UHR characterization using molecular markers may improve understanding, provide novel insight into pathophysiology, and perhaps improve psychosis prediction reliability. Whole-blood gene expressions from 56 UHR subjects and 28 healthy controls are checked for existence of a consistent gene expression profile (signature) underlying UHR, across a variety of normalization and heterogeneity-removal techniques, including simple log-conversion, quantile normalization, gene fuzzy scoring (GFS), and surrogate variable analysis. During functional analysis, consistent and reproducible identification of important genes depends largely on how data are normalized. Normalization techniques that address sample heterogeneity are superior. The best performer, the unsupervised GFS, produced a strong and concise 12-gene signature, enriched for psychosis-associated genes. Importantly, when applied on random subsets of data, classifiers built with GFS are "meaningful" in the sense that the classifier models built using genes selected after other forms of normalization do not outperform random ones, but GFS-derived classifiers do. Data normalization can present highly disparate interpretations on biological data. Comparative analysis has shown that GFS is efficient at preserving signals while eliminating noise. Using this, we demonstrate confidently that the UHR designation is well correlated with a distinct blood-based gene signature.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore
- Department of Computer Science, National University of Singapore, Singapore
| | - Judy Chia-Ghee Sng
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Jie Yin Yee
- Research Division, Institute of Mental Health, Singapore
| | - Yuen Mei See
- Research Division, Institute of Mental Health, Singapore
| | - Tih-Shih Lee
- Neuroscience and Behavioral Disorders Program, Duke–National University of Singapore Medical School, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore
- Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Jimmy Lee
- Research Division, Institute of Mental Health, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| |
Collapse
|
23
|
Teh DBL, Prasad A, Jiang W, Ariffin MZ, Khanna S, Belorkar A, Wong L, Liu X, All AH. Transcriptome Analysis Reveals Neuroprotective aspects of Human Reactive Astrocytes induced by Interleukin 1β. Sci Rep 2017; 7:13988. [PMID: 29070875 PMCID: PMC5656635 DOI: 10.1038/s41598-017-13174-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Accepted: 09/21/2017] [Indexed: 12/13/2022] Open
Abstract
Reactive astrogliosis is a critical process in neuropathological conditions and neurotrauma. Although it has been suggested that it confers neuroprotective effects, the exact genomic mechanism has not been explored. The prevailing dogma of the role of astrogliosis in inhibition of axonal regeneration has been challenged by recent findings in rodent model’s spinal cord injury, demonstrating its neuroprotection and axonal regeneration properties. We examined whether their neuroprotective and axonal regeneration potentials can be identify in human spinal cord reactive astrocytes in vitro. Here, reactive astrogliosis was induced with IL1β. Within 24 hours of IL1β induction, astrocytes acquired reactive characteristics. Transcriptome analysis of over 40000 transcripts of genes and analysis with PFSnet subnetwork revealed upregulation of chemokines and axonal permissive factors including FGF2, BDNF, and NGF. In addition, most genes regulating axonal inhibitory molecules, including ROBO1 and ROBO2 were downregulated. There was no increase in the gene expression of “Chondroitin Sulfate Proteoglycans” (CSPGs’) clusters. This suggests that reactive astrocytes may not be the main CSPG contributory factor in glial scar. PFSnet analysis also indicated an upregulation of “Axonal Guidance Signaling” pathway. Our result suggests that human spinal cord reactive astrocytes is potentially neuroprotective at an early onset of reactive astrogliosis.
Collapse
Affiliation(s)
- Daniel Boon Loong Teh
- Singapore Institute of Neurotechnology (SINAPSE), National University of Singapore, 28 Medical Drive, 5-COR, Singapore, 117456, Singapore
| | - Ankshita Prasad
- Department of Biomedical Engineering, National University of Singapore, E4, 4 Engineering Drive 3, Singapore, 117583, Singapore
| | - Wenxuan Jiang
- Department of Orthopaedic Surgery, National University of Singapore, 1E Kent Ridge Road, Singapore, 119228, Singapore
| | - Mohd Zacky Ariffin
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Sanjay Khanna
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Abha Belorkar
- Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore
| | - Xiaogang Liu
- Department of Chemistry, National University of Singapore, 3 Science Drive 3, Singapore, 117543, Singapore.
| | - Angelo H All
- Singapore Institute of Neurotechnology (SINAPSE), National University of Singapore, 28 Medical Drive, 5-COR, Singapore, 117456, Singapore. .,Department of Biomedical Engineering and Johns Hopkins School of Medicine, 701C Rutland Avenue 720, Baltimore, MD 21205, USA. .,Department of Neurology, Johns Hopkins School of Medicine, 701C Rutland Avenue 720, Baltimore, MD 21205, USA.
| |
Collapse
|
24
|
Abstract
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms, but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package NetProt, which provides implementations of representative feature-selection methods. NetProt also provides methods for generating simulated differential data and generating pseudocomplexes for complex-based performance benchmarking. The NetProt open source R package is available for download from https://github.com/gohwils/NetProt/releases/ , and online documentation is available at http://rpubs.com/gohwils/204259 .
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University , 92 Weijin Road, Tianjin 300072, China.,School of Biological Sciences, Nanyang Technological University , 60 Nanyang Drive, Singapore 637551.,Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417.,Department of Pathology, National University of Singapore , 5 Lower Kent Ridge Road, Singapore 119074
| |
Collapse
|
25
|
Costanzo M, Zacchia M, Bruno G, Crisci D, Caterino M, Ruoppolo M. Integration of Proteomics and Metabolomics in Exploring Genetic and Rare Metabolic Diseases. KIDNEY DISEASES 2017; 3:66-77. [PMID: 28868294 DOI: 10.1159/000477493] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 05/15/2017] [Indexed: 12/12/2022]
Abstract
BACKGROUND Inherited metabolic disorders or inborn errors of metabolism are caused by deficiency of enzymatic activities in the catabolism of amino acids, carbohydrates, or lipids. These disorders include aminoacidopathies, urea cycle defects, organic acidemias, defects of oxidation of fatty acids, and lysosomal storage diseases. Inborn errors of metabolism constitute a significant proportion of genetic diseases, particularly in children; however, they are individually rare. Clinical phenotypes are very variable, some of them remain asymptomatic, others manifest metabolic decompensation in neonatal age, and others encompass mental retardation at later age. The clinical manifestation of these disorders can involve different organs and/or systems. Some disorders are easily managed if promptly diagnosed and treated, whereas in other cases neither diet, vitamin therapy, nor transplantation appears to prevent multi-organ impairment. SUMMARY Here, we discuss the principal challenges of metabolomics and proteomics in inherited metabolic disorders. We review the recent developments in mass spectrometry-based proteomic and metabolomic strategies. Mass spectrometry has become the most widely used platform in proteomics and metabolomics because of its ability to analyze a wide range of molecules, its optimal dynamic range, and great sensitivity. The fast measurement of a broad spectrum of metabolites in various body fluids, also collected in small samples like dried blood spots, have been facilitated by the use of mass spectrometry-based techniques. These approaches have enabled the timely diagnosis of inherited metabolic disorders, thereby facilitating early therapeutic intervention. Due to its analytical features, proteomics is suited for the basic investigation of inborn errors of metabolism. Modern approaches enable detailed functional characterization of the pathogenic biochemical processes, as achieved by quantification of proteins and identification of their regulatory chemical modifications. KEY MESSAGE Mass spectrometry-based "omics" approaches most frequently used to study the molecular mechanisms underlying inherited metabolic disorders pathophysiology are described.
Collapse
Affiliation(s)
- Michele Costanzo
- Dipartimento di Medicina Molecolare e Biotecnologie Mediche, Università degli Studi di Napoli "Federico II," Naples, Italy
| | - Miriam Zacchia
- Prima Divisione di Nefrologia, Dipartimento di Scienze Cardio-Toraciche e Respiratorie, Università degli studi della Campania "Luigi Vanvitelli," Scuola di Medicina, Naples, Italy
| | | | - Daniela Crisci
- Dipartimento di Medicina Molecolare e Biotecnologie Mediche, Università degli Studi di Napoli "Federico II," Naples, Italy.,CEINGE - Biotecnologie Avanzate scarl, Naples, Italy
| | - Marianna Caterino
- Dipartimento di Medicina Molecolare e Biotecnologie Mediche, Università degli Studi di Napoli "Federico II," Naples, Italy.,CEINGE - Biotecnologie Avanzate scarl, Naples, Italy.,Associazione culturale DiSciMuS RCF, Naples, Italy
| | - Margherita Ruoppolo
- Dipartimento di Medicina Molecolare e Biotecnologie Mediche, Università degli Studi di Napoli "Federico II," Naples, Italy.,CEINGE - Biotecnologie Avanzate scarl, Naples, Italy.,Associazione culturale DiSciMuS RCF, Naples, Italy
| |
Collapse
|
26
|
Goh WWB, Wong L. Class-paired Fuzzy SubNETs: A paired variant of the rank-based network analysis family for feature selection based on protein complexes. Proteomics 2017; 17:e1700093. [PMID: 28390171 DOI: 10.1002/pmic.201700093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 04/05/2017] [Indexed: 01/12/2023]
Abstract
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are similar in the sense that they deploy rank-defined weights among proteins per sample. This procedure is known as gene fuzzy scoring. Currently, no RBNA exists for paired-sample scenarios where both control and test tissues originate from the same source (e.g. same patient). It is expected that paired tests, when used appropriately, are more powerful than approaches intended for unpaired samples. We report that the class-paired RBNA, PPFSNET, dominates in both simulated and real data scenarios. Moreover, for the first time, we explicitly incorporate batch-effect resistance as an additional evaluation criterion for feature-selection approaches. Batch effects are class irrelevant variations arising from different handlers or processing times, and can obfuscate analysis. We demonstrate that PPFSNET and an earlier RBNA, PFSNET, are particularly resistant against batch effects, and only select features strongly correlated with class but not batch.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, P. R. China.,Department of Computer Science, National University of Singapore, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore.,Department of Pathology, National University of Singapore, Singapore
| |
Collapse
|