1
|
Xu S, Ackerman ME. Leveraging permutation testing to assess confidence in positive-unlabeled learning applied to high-dimensional biological datasets. BMC Bioinformatics 2024; 25:218. [PMID: 38898392 PMCID: PMC11186207 DOI: 10.1186/s12859-024-05834-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 06/10/2024] [Indexed: 06/21/2024] Open
Abstract
BACKGROUND Compared to traditional supervised machine learning approaches employing fully labeled samples, positive-unlabeled (PU) learning techniques aim to classify "unlabeled" samples based on a smaller proportion of known positive examples. This more challenging modeling goal reflects many real-world scenarios in which negative examples are not available-posing direct challenges to defining prediction accuracy and robustness. While several studies have evaluated predictions learned from only definitive positive examples, few have investigated whether correct classification of a high proportion of known positives (KP) samples from among unlabeled samples can act as a surrogate to indicate model quality. RESULTS In this study, we report a novel methodology combining multiple established PU learning-based strategies with permutation testing to evaluate the potential of KP samples to accurately classify unlabeled samples without using "ground truth" positive and negative labels for validation. Multivariate synthetic and real-world high-dimensional benchmark datasets were employed to demonstrate the suitability of the proposed pipeline to provide evidence of model robustness across varied underlying ground truth class label compositions among the unlabeled set and with different proportions of KP examples. Comparisons between model performance with actual and permuted labels could be used to distinguish reliable from unreliable models. CONCLUSIONS As in fully supervised machine learning, permutation testing offers a means to set a baseline "no-information rate" benchmark in the context of semi-supervised PU learning inference tasks-providing a standard against which model performance can be compared.
Collapse
Affiliation(s)
- Shiwei Xu
- Quantitative Biomedical Sciences Program, Dartmouth College, Hanover, NH, USA
| | - Margaret E Ackerman
- Quantitative Biomedical Sciences Program, Dartmouth College, Hanover, NH, USA.
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Dartmouth College, Hanover, NH, USA.
- Thayer School of Engineering, Dartmouth College, 14 Engineering Dr., Hanover, NH, 03755, USA.
| |
Collapse
|
2
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|
3
|
Ansari M, White AD. Learning Peptide Properties with Positive Examples Only. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.01.543289. [PMID: 37333233 PMCID: PMC10274696 DOI: 10.1101/2023.06.01.543289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| |
Collapse
|
4
|
Yehuda GA, Somekh J. A methodology for classifying tissue-specific metabolic and inflammatory receptor functions applied to subcutaneous and visceral adipose. PLoS One 2022; 17:e0276699. [PMID: 36282842 PMCID: PMC9595531 DOI: 10.1371/journal.pone.0276699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Accepted: 10/12/2022] [Indexed: 11/07/2022] Open
Abstract
To achieve homeostasis, the human biological system relies on the interaction between organs through the binding of ligands secreted from source organs to receptors located on destination organs. Currently, the changing roles that receptors perform in tissues are only partially understood. Recently, a methodology based on receptor co-expression patterns to classify their tissue-specific metabolic functions was suggested. Here we present an advanced framework to predict an additional class of inflammatory receptors that use a feature space of biological pathway enrichment analysis scores of co-expression networks and their eigengene correlations. These are fed into three machine learning classifiers-eXtreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), and K-Nearest Neighbors (k-NN). We applied our methodology to subcutaneous and visceral adipose gene expression datasets derived from the GTEx (Genotype-Tissue Expression) project and compared the predictions. The XGBoost model demonstrated the best performance in predicting the pre-labeled receptors, with an accuracy of 0.89/0.8 in subcutaneous/visceral adipose. We analyzed ~700 receptors to predict eight new metabolic and 15 new inflammatory functions of receptors and four new metabolic functions for known inflammatory receptors in both adipose tissues. We cross-referenced multiple predictions using the published literature. Our results establish a picture of the changing functions of receptors for two adipose tissues that can be beneficial for drug development.
Collapse
Affiliation(s)
| | - Judith Somekh
- Information Systems, University of Haifa, Haifa, Israel
| |
Collapse
|
5
|
DTIP-TC2A: An analytical framework for drug-target interactions prediction methods. Comput Biol Chem 2022; 99:107707. [DOI: 10.1016/j.compbiolchem.2022.107707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 05/01/2022] [Accepted: 05/26/2022] [Indexed: 11/18/2022]
|
6
|
Somekh J. A methodology for predicting tissue-specific metabolic roles of receptors applied to subcutaneous adipose. Sci Rep 2020; 10:19535. [PMID: 33177567 PMCID: PMC7659321 DOI: 10.1038/s41598-020-73214-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 09/11/2020] [Indexed: 01/18/2023] Open
Abstract
The human biological system uses 'inter-organ' communication to achieve a state of homeostasis. This communication occurs through the response of receptors, located on target organs, to the binding of secreted ligands from source organs. Albeit years of research, the roles these receptors play in tissues is only partially understood. This work presents a new methodology based on the enrichment analysis scores of co-expression networks fed into support vector machines (SVMs) and k-NN classifiers to predict the tissue-specific metabolic roles of receptors. The approach is primarily based on the detection of coordination patterns of receptors expression. These patterns and the enrichment analysis scores of their co-expression networks were used to analyse ~ 700 receptors and predict metabolic roles of receptors in subcutaneous adipose. To facilitate supervised learning, a list of known metabolic and non-metabolic receptors was constructed using a semi-supervised approach following literature-based verification. Our approach confirms that pathway enrichment scores are good signatures for correctly classifying the metabolic receptors in adipose. We also show that the k-NN method outperforms the SVM method in classifying metabolic receptors. Finally, we predict novel metabolic roles of receptors. These predictions can enhance biological understanding and the development of new receptor-targeting metabolic drugs.
Collapse
Affiliation(s)
- Judith Somekh
- Department of Information Systems, University of Haifa, Haifa, Israel.
| |
Collapse
|
7
|
Ke T, Li M, Zhang L, Lv H, Ge X. Construct a biased SVM classifier based on Chebyshev distance for PU learning. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-192064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In some real applications, only limited labeled positive examples and many unlabeled examples are available, but there are no negative examples. Such learning is termed as positive and unlabeled (PU) learning. PU learning algorithm has been studied extensively in recent years. However, the classical ones based on the Support Vector Machines (SVMs) are assumed that labeled positive data is independent and identically distributed (i.i.d) and the sample size is large enough. It leads to two obvious shortcomings. On the one hand, the performance is not satisfactory, especially when the number of the labeled positive examples is small. On the other hand, classification results are not optimistic when datasets are Non-i.i.d. For this reason, this paper proposes a novel SVM classifier using Chebyshev distance to measure the empirical risk and designs an efficient iterative algorithm, named L∞ - BSVM in short. L∞ - BSVM includes the following merits: (1) it allows all sample points to participate in learning to prompt classification performance, especially in the case where the size of labeled data is small; (2) it minimizes the distance of the sample points that are (outliers in Non-i.i.d) farthest from the hyper-plane, where outliers are sufficiently taken into consideration (3) our iterative algorithm can solve large scale optimization problem with low time complexity and ensure the convergence of the optimum solution. Finally, extensive experiments on three types of datasets: artificial Non-i.i.d datasets, fault diagnosis of railway turnout with few labeled data (abnormal turnout) and six benchmark real-world datasets verify above opinions again and demonstrate that our classifier is much better than state-of-the-art competitors, such as B-SVM, LUHC, Pulce, B-LSSVM, NB and so on.
Collapse
Affiliation(s)
- Ting Ke
- Department of Mathematics, College of Science, Tianjin University of Science & Technology, Tianjin, China
| | - Min Li
- Department of Mathematics, College of Science, Tianjin University of Science & Technology, Tianjin, China
| | - Lidong Zhang
- Department of Mathematics, College of Science, Tianjin University of Science & Technology, Tianjin, China
| | - Hui Lv
- Department of Mathematics, College of Science, Tianjin University of Science & Technology, Tianjin, China
| | - Xuechun Ge
- China Academy of Railway Sciences Signal and Communication Research Institute (Beijing Huatie Information Technology Corporation), Beijing, China
| |
Collapse
|
8
|
Foong R, Ang KK, Zhang Z, Quek C. An iterative cross-subject negative-unlabeled learning algorithm for quantifying passive fatigue. J Neural Eng 2019; 16:056013. [PMID: 31141797 DOI: 10.1088/1741-2552/ab255d] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
OBJECTIVE This paper proposes an iterative negative-unlabeled (NU) learning algorithm for cross-subject detection of passive fatigue from labelled alert (negative) and unlabeled driving EEG data. APPROACH Unlike other studies which used manual labeling of the fatigue state, the proposed algorithm (PA) first iteratively uses 29 subjects' alert data and unlabeled driving data to identify the most fatigued block of EEG data in each subject in a cross-subject manner. Subsequently, the PA computes subjects' driving fatigue score. Repeated measures correlations of the score to EEG band powers are then performed. MAIN RESULTS The PA yields an averaged accuracy of 93.77% ± 8.15% across subjects in detecting fatigue, which is significantly better than the various baselines. The fatigue scores obtained are also significantly positively correlated with theta band power and negatively correlated with beta band power that are known to respectively increase and decrease in presence of passive fatigue. There is a strong negative correlation with alpha band power as well. SIGNIFICANCE The proposed iterative NU learning algorithm is capable of labelling and quantifying the most fatigued block in a cross-subject manner despite the lack of ground truth in the fatigue levels of unlabeled driving EEG data. Together with the significant correlations with theta, alpha and beta band power, the results show promise in the application of the proposed algorithm to detect fatigue from EEG.
Collapse
Affiliation(s)
- Ruyi Foong
- Neural and Biomedical Technology, Institute for Infocomm Research, Singapore. School of Computer Science and Engineering, Nanyang Technological University, Singapore
| | | | | | | |
Collapse
|
9
|
Frey NC, Wang J, Vega Bellido GI, Anasori B, Gogotsi Y, Shenoy VB. Prediction of Synthesis of 2D Metal Carbides and Nitrides (MXenes) and Their Precursors with Positive and Unlabeled Machine Learning. ACS NANO 2019; 13:3031-3041. [PMID: 30830760 DOI: 10.1021/acsnano.8b08014] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Growing interest in the potential applications of two-dimensional (2D) materials has fueled advancement in the identification of 2D systems with exotic properties. Increasingly, the bottleneck in this field is the synthesis of these materials. Although theoretical calculations have predicted a myriad of promising 2D materials, only a few dozen have been experimentally realized since the initial discovery of graphene. Here, we adapt the state-of-the-art positive and unlabeled (PU) machine learning framework to predict which theoretically proposed 2D materials have the highest likelihood of being successfully synthesized. Using elemental information and data from high-throughput density functional theory calculations, we apply the PU learning method to the MXene family of 2D transition metal carbides, carbonitrides, and nitrides, and their layered precursor MAX phases, and identify 18 MXene compounds that are highly promising candidates for synthesis. By considering both the MXenes and their precursors, we further propose 20 synthesizable MAX phases that can be chemically exfoliated to produce MXenes.
Collapse
Affiliation(s)
- Nathan C Frey
- Department of Materials Science and Engineering , University of Pennsylvania , Philadelphia , Pennsylvania 19104 , United States
| | - Jin Wang
- Department of Materials Science and Engineering , University of Pennsylvania , Philadelphia , Pennsylvania 19104 , United States
| | - Gabriel Iván Vega Bellido
- Department of Materials Science and Engineering , University of Pennsylvania , Philadelphia , Pennsylvania 19104 , United States
- Department of Chemical Engineering , University of Puerto Rico at Mayagüez , Mayagüez 00681 , Puerto Rico
| | - Babak Anasori
- Department of Materials Science and Engineering and A.J. Drexel Nanomaterials Institute , Drexel University , Philadelphia , Pennsylvania 19104 , United States
| | - Yury Gogotsi
- Department of Materials Science and Engineering and A.J. Drexel Nanomaterials Institute , Drexel University , Philadelphia , Pennsylvania 19104 , United States
| | - Vivek B Shenoy
- Department of Materials Science and Engineering , University of Pennsylvania , Philadelphia , Pennsylvania 19104 , United States
| |
Collapse
|
10
|
Pogodin PV, Lagunin AA, Rudik AV, Filimonov DA, Druzhilovskiy DS, Nicklaus MC, Poroikov VV. How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors. Front Chem 2018; 6:133. [PMID: 29755970 PMCID: PMC5935003 DOI: 10.3389/fchem.2018.00133] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 04/09/2018] [Indexed: 12/16/2022] Open
Abstract
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of "active" and "inactive" compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
Collapse
Affiliation(s)
- Pavel V. Pogodin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Alexey A. Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Medical-Biological Department, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Anastasia V. Rudik
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Dmitry A. Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | | | - Mark C. Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, NCI-Frederick, Frederick, MD, United States
| | | |
Collapse
|
11
|
Ke T, Jing L, Lv H, Zhang L, Hu Y. Global and local learning from positive and unlabeled examples. APPL INTELL 2017. [DOI: 10.1007/s10489-017-1076-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
12
|
Nagi S, Bhattacharyya DK. Classification of microarray cancer data using ensemble approach. ACTA ACUST UNITED AC 2013. [DOI: 10.1007/s13721-013-0034-x] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|