1
|
Arif R, Kanwal S, Ahmed S, Kabir M. A Computational Predictor for Accurate Identification of Tumor Homing Peptides by Integrating Sequential and Deep BiLSTM Features. Interdiscip Sci 2024; 16:503-518. [PMID: 38733473 DOI: 10.1007/s12539-024-00628-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 03/16/2024] [Accepted: 03/27/2024] [Indexed: 05/13/2024]
Abstract
Cancer remains a severe illness, and current research indicates that tumor homing peptides (THPs) play an important part in cancer therapy. The identification of THPs can provide crucial insights for drug-discovery and pharmaceutical industries as they allow for tailored medication delivery towards cancer cells. These peptides have a high affinity enabling particular receptors present upon tumor surfaces, allowing for the creation of precision medications that reduce off-target consequences and enhance cancer patient treatment results. Wet-lab techniques are considered essential tools for studying THPs; however, they're labor-extensive and time-consuming, therefore making prediction of THPs a challenging task for the researchers. Computational-techniques, on the other hand, are considered significant tools in identifying THPs according to the sequence data. Despite many strategies have been presented to predict new THP, there is still a need to develop a robust method with higher rates of success. In this paper, we developed a novel framework, THP-DF, for accurately identifying THPs on a large-scale. Firstly, the peptide sequences are encoded through various sequential features. Secondly, each feature is passed to BiLSTM and attention layers to extract simplified deep features. Finally, an ensemble-framework is formed via integrating sequential- and deep features which are fed to a support vector machine which with 10-fold cross-validation to carry to validate the efficiency. The experimental results showed that THP-DF worked better on both [Formula: see text] and [Formula: see text] datasets by achieving accuracy of > 95% which are higher than existing predictors both datasets. This indicates that the proposed predictor could be a beneficial tool to precisely and rapidly identify THPs and will contribute to the cutting-edge cancer treatment strategies and pharmaceuticals.
Collapse
Affiliation(s)
- Roha Arif
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Sameera Kanwal
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan.
| |
Collapse
|
2
|
Wang J, Zhang L, Zeng A, Xia D, Yu J, Yu G. DeepIII: Predicting Isoform-Isoform Interactions by Deep Neural Networks and Data Fusion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2177-2187. [PMID: 33764878 DOI: 10.1109/tcbb.2021.3068875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Alternative splicing enables a gene translating into different isoforms and into the corresponding proteoforms, which actually accomplish various biological functions of a living body. Isoform-isoform interactions (IIIs) provide a higher resolution interactome to explore the cellular processes and disease mechanisms than the canonically studied protein-protein interactions (PPIs), which are often recorded at the coarse gene level. The knowledge of IIIs is critical to map pathways, understand protein complexity and functional diversity, but the known IIIs are very scanty. In this paper, we propose a deep learning based method called DeepIII to systematically predict genome-wide IIIs by integrating diverse data sources, including RNA-seq datasets of different human tissues, exon array data, domain-domain interactions (DDIs) of proteins, nucleotide sequences and amino acid sequences. Particularly, DeepIII fuses these data to learn the representation of isoform pairs with a four-layer deep neural networks, and then performs binary classification on the learnt representation to achieve the prediction of IIIs. Experimental results show that DeepIII achieves a superior prediction performance to the state-of-the-art solutions and the III network constructed by DeepIII gives more accurate isoform function prediction. Case studies further confirm that DeepIII can differentiate the individual interaction partners of different isoforms spliced from the same gene. The code and datasets of DeepIII are available at http://mlda.swu.edu.cn/codes.php?name=DeepIII.
Collapse
|
3
|
Huang L, Liao L, Wu CH. Completing sparse and disconnected protein-protein network by deep learning. BMC Bioinformatics 2018; 19:103. [PMID: 29566671 PMCID: PMC5863833 DOI: 10.1186/s12859-018-2112-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 03/12/2018] [Indexed: 12/01/2022] Open
Abstract
Background Protein-protein interaction (PPI) prediction remains a central task in systems biology to achieve a better and holistic understanding of cellular and intracellular processes. Recently, an increasing number of computational methods have shifted from pair-wise prediction to network level prediction. Many of the existing network level methods predict PPIs under the assumption that the training network should be connected. However, this assumption greatly affects the prediction power and limits the application area because the current golden standard PPI networks are usually very sparse and disconnected. Therefore, how to effectively predict PPIs based on a training network that is sparse and disconnected remains a challenge. Results In this work, we developed a novel PPI prediction method based on deep learning neural network and regularized Laplacian kernel. We use a neural network with an autoencoder-like architecture to implicitly simulate the evolutionary processes of a PPI network. Neurons of the output layer correspond to proteins and are labeled with values (1 for interaction and 0 for otherwise) from the adjacency matrix of a sparse disconnected training PPI network. Unlike autoencoder, neurons at the input layer are given all zero input, reflecting an assumption of no a priori knowledge about PPIs, and hidden layers of smaller sizes mimic ancient interactome at different times during evolution. After the training step, an evolved PPI network whose rows are outputs of the neural network can be obtained. We then predict PPIs by applying the regularized Laplacian kernel to the transition matrix that is built upon the evolved PPI network. The results from cross-validation experiments show that the PPI prediction accuracies for yeast data and human data measured as AUC are increased by up to 8.4 and 14.9% respectively, as compared to the baseline. Moreover, the evolved PPI network can also help us leverage complementary information from the disconnected training network and multiple heterogeneous data sources. Tested by the yeast data with six heterogeneous feature kernels, the results show our method can further improve the prediction performance by up to 2%, which is very close to an upper bound that is obtained by an Approximate Bayesian Computation based sampling method. Conclusions The proposed evolution deep neural network, coupled with regularized Laplacian kernel, is an effective tool in completing sparse and disconnected PPI networks and in facilitating integration of heterogeneous data sources.
Collapse
Affiliation(s)
- Lei Huang
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716, Delaware, USA
| | - Li Liao
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716, Delaware, USA.
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716, Delaware, USA.,Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711, Delaware, USA
| |
Collapse
|
4
|
Du T, Liao L, Wu CH. Enhancing interacting residue prediction with integrated contact matrix prediction in protein-protein interaction. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2016; 2016:17. [PMID: 27818677 PMCID: PMC5075339 DOI: 10.1186/s13637-016-0051-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/02/2016] [Accepted: 09/25/2016] [Indexed: 11/10/2022]
Abstract
Identifying the residues in a protein that are involved in protein-protein interaction and identifying the contact matrix for a pair of interacting proteins are two computational tasks at different levels of an in-depth analysis of protein-protein interaction. Various methods for solving these two problems have been reported in the literature. However, the interacting residue prediction and contact matrix prediction were handled by and large independently in those existing methods, though intuitively good prediction of interacting residues will help with predicting the contact matrix. In this work, we developed a novel protein interacting residue prediction system, contact matrix-interaction profile hidden Markov model (CM-ipHMM), with the integration of contact matrix prediction and the ipHMM interaction residue prediction. We propose to leverage what is learned from the contact matrix prediction and utilize the predicted contact matrix as "feedback" to enhance the interaction residue prediction. The CM-ipHMM model showed significant improvement over the previous method that uses the ipHMM for predicting interaction residues only. It indicates that the downstream contact matrix prediction could help the interaction site prediction.
Collapse
Affiliation(s)
- Tianchuan Du
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| | - Li Liao
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| |
Collapse
|
5
|
Mier P, Alanis-Lobato G, Andrade-Navarro MA. Protein-protein interactions can be predicted using coiled coil co-evolution patterns. J Theor Biol 2016; 412:198-203. [PMID: 27832945 DOI: 10.1016/j.jtbi.2016.11.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Revised: 10/21/2016] [Accepted: 11/04/2016] [Indexed: 12/29/2022]
Abstract
Protein-protein interactions are sometimes mediated by coiled coil structures. The evolutionary conservation of interacting orthologs in different species, along with the presence or absence of coiled coils in them, may help in the prediction of interacting pairs. Here, we illustrate how the presence of coiled coils in a protein can be exploited as a potential indicator for its interaction with another protein with coiled coils. The prediction capability of our strategy improves when restricting our dataset to highly reliable, known protein-protein interactions. Our study of the co-evolution of coiled coils demonstrates that pairs of interacting proteins can be distinguished from not interacting pairs by means of their structural information. This hints at the potential of our strategy to predict new protein-protein interactions.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Gresemundweg 2, 55128 Mainz, Germany; Institute of Molecular Biology, Ackermannweg 4, 55128 Mainz, Germany
| | - Gregorio Alanis-Lobato
- Faculty of Biology, Johannes Gutenberg University Mainz, Gresemundweg 2, 55128 Mainz, Germany; Institute of Molecular Biology, Ackermannweg 4, 55128 Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg University Mainz, Gresemundweg 2, 55128 Mainz, Germany; Institute of Molecular Biology, Ackermannweg 4, 55128 Mainz, Germany
| |
Collapse
|
6
|
Sze-To A, Fung S, Lee ESA, Wong AK. Prediction of Protein–Protein Interaction via co-occurring Aligned Pattern Clusters. Methods 2016; 110:26-34. [DOI: 10.1016/j.ymeth.2016.07.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2016] [Revised: 06/25/2016] [Accepted: 07/26/2016] [Indexed: 10/21/2022] Open
|
7
|
Huang L, Liao L, Wu CH. Protein-protein interaction prediction based on multiple kernels and partial network with linear programming. BMC SYSTEMS BIOLOGY 2016. [PMCID: PMC4977483 DOI: 10.1186/s12918-016-0296-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/30/2023]
Abstract
Background Prediction of de novo protein-protein interaction is a critical step toward reconstructing PPI networks, which is a central task in systems biology. Recent computational approaches have shifted from making PPI prediction based on individual pairs and single data source to leveraging complementary information from multiple heterogeneous data sources and partial network structure. However, how to quickly learn weights for heterogeneous data sources remains a challenge. In this work, we developed a method to infer de novo PPIs by combining multiple data sources represented in kernel format and obtaining optimal weights based on random walk over the existing partial networks. Results Our proposed method utilizes Barker algorithm and the training data to construct a transition matrix which constrains how a random walk would traverse the partial network. Multiple heterogeneous features for the proteins in the network are then combined into the form of weighted kernel fusion, which provides a new "adjacency matrix" for the whole network that may consist of disconnected components but is required to comply with the transition matrix on the training subnetwork. This requirement is met by adjusting the weights to minimize the element-wise difference between the transition matrix and the weighted kernels. The minimization problem is solved by linear programming. The weighted kernel fusion is then transformed to regularized Laplacian (RL) kernel to infer missing or new edges in the PPI network, which can potentially connect the previously disconnected components. Conclusions The results on synthetic data demonstrated the soundness and robustness of the proposed algorithms under various conditions. And the results on real data show that the accuracies of PPI prediction for yeast data and human data measured as AUC are increased by up to 19 % and 11 % respectively, as compared to a control method without using optimal weights. Moreover, the weights learned by our method Weight Optimization by Linear Programming (WOLP) are very consistent with that learned by sampling, and can provide insights into the relations between PPIs and various feature kernel, thereby improving PPI prediction even for disconnected PPI networks.
Collapse
|
8
|
Huang L, Liao L, Wu CH. Inference of protein-protein interaction networks from multiple heterogeneous data. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2016; 2016:8. [PMID: 26941784 PMCID: PMC4761017 DOI: 10.1186/s13637-016-0040-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 02/09/2016] [Indexed: 11/29/2022]
Abstract
Protein-protein interaction (PPI) prediction is a central task in achieving a better understanding of cellular and intracellular processes. Because high-throughput experimental methods are both expensive and time-consuming, and are also known of suffering from the problems of incompleteness and noise, many computational methods have been developed, with varied degrees of success. However, the inference of PPI network from multiple heterogeneous data sources remains a great challenge. In this work, we developed a novel method based on approximate Bayesian computation and modified differential evolution sampling (ABC-DEP) and regularized laplacian (RL) kernel. The method enables inference of PPI networks from topological properties and multiple heterogeneous features including gene expression and Pfam domain profiles, in forms of weighted kernels. The optimal weights are obtained by ABC-DEP, and the kernel fusion built based on optimal weights serves as input to RL to infer missing or new edges in the PPI network. Detailed comparisons with control methods have been made, and the results show that the accuracy of PPI prediction measured by AUC is increased by up to 23 %, as compared to a baseline without using optimal weights. The method can provide insights into the relations between PPIs and various feature kernels and demonstrates strong capability of predicting faraway interactions that cannot be well detected by traditional RL method.
Collapse
Affiliation(s)
- Lei Huang
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716 DE USA
| | - Li Liao
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716 DE USA
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716 DE USA ; Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711 DE USA
| |
Collapse
|
9
|
Du X, Jing A, Hu X. A novel feature extraction scheme for prediction of protein-protein interaction sites. MOLECULAR BIOSYSTEMS 2014; 11:475-85. [PMID: 25413666 DOI: 10.1039/c4mb00625a] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Identifying protein-protein interaction (PPI) sites plays an important and challenging role in some topics of biology. Although many methods have been proposed, this problem is still far away to be solved. Here, a feature selection approach with an 11-sliding window and random forest algorithm is proposed, which is called DX-RF. This method has achieved an accuracy of 88.79%, recall of 82.09%, and precision of 85.76% with top-ranked 34 features on the Hetero test dataset and has 91.6% accuracy, 89.2% precision, 83.54% recall with top-ranked 25 features set on the Homo test dataset. Compared to other methods, the results indicate that the DX-RF method has a strong ability to select relevance features to get a higher performance. Moreover, in order to further understand protein interactions, feature analysis in this study is also performed.
Collapse
Affiliation(s)
- Xiuquan Du
- Key Laboratory of Intelligent Computing & Signal Processing, Ministry of Education, Anhui University, Anhui, China.
| | | | | |
Collapse
|
10
|
Zhu Y, Zhou W, Dai DQ, Yan H. Identification of DNA-binding and protein-binding proteins using enhanced graph wavelet features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1017-1031. [PMID: 24334394 DOI: 10.1109/tcbb.2013.117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Interactions between biomolecules play an essential role in various biological processes. For predicting DNA-binding or protein-binding proteins, many machine-learning-based techniques have used various types of features to represent the interface of the complexes, but they only deal with the properties of a single atom in the interface and do not take into account the information of neighborhood atoms directly. This paper proposes a new feature representation method for biomolecular interfaces based on the theory of graph wavelet. The enhanced graph wavelet features (EGWF) provides an effective way to characterize interface feature through adding physicochemical features and exploiting a graph wavelet formulation. Particularly, graph wavelet condenses the information around the center atom, and thus enhances the discrimination of features of biomolecule binding proteins in the feature space. Experiment results show that EGWF performs effectively for predicting DNA-binding and protein-binding proteins in terms of Matthew's correlation coefficient (MCC) score and the area value under the receiver operating characteristic curve (AUC).
Collapse
Affiliation(s)
- Yuan Zhu
- Guangdong University of Finance and Economics, Guangzhou and Sun Yat-Sen University, Guangzhou
| | | | | | - Hong Yan
- City University of Hong Kong, Hong Kong and University of Sydney, Sydney
| |
Collapse
|
11
|
González AJ, Liao L, Wu CH. Prediction of contact matrix for protein-protein interaction. ACTA ACUST UNITED AC 2013; 29:1018-25. [PMID: 23418186 DOI: 10.1093/bioinformatics/btt076] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
MOTIVATION Prediction of protein-protein interaction has become an important part of systems biology in reverse engineering the biological networks for better understanding the molecular biology of the cell. Although significant progress has been made in terms of prediction accuracy, most computational methods only predict whether two proteins interact but not their interacting residues-the information that can be very valuable for understanding the interaction mechanisms and designing modulation of the interaction. In this work, we developed a computational method to predict the interacting residue pairs-contact matrix for interacting protein domains, whose rows and columns correspond to the residues in the two interacting domains respectively and whose values (1 or 0) indicate whether the corresponding residues (do or do not) interact. RESULTS Our method is based on supervised learning using support vector machines. For each domain involved in a given domain-domain interaction (DDI), an interaction profile hidden Markov model (ipHMM) is first built for the domain family, and then each residue position for a member domain sequence is represented as a 20-dimension vector of Fisher scores, characterizing how similar it is as compared with the family profile at that position. Each element of the contact matrix for a sequence pair is now represented by a feature vector from concatenating the vectors of the two corresponding residues, and the task is to predict the element value (1 or 0) from the feature vector. A support vector machine is trained for a given DDI, using either a consensus contact matrix or contact matrices for individual sequence pairs, and is tested by leave-one-out cross validation. The performance averaged over a set of 115 DDIs collected from the 3 DID database shows significant improvement (sensitivity up to 85%, and specificity up to 85%), as compared with a multiple sequence alignment-based method (sensitivity 57%, and specificity 78%) previously reported in the literature. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alvaro J González
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | | | | |
Collapse
|
12
|
Urquiza JM, Rojas I, Pomares H, Herrera J, Florido JP, Valenzuela O, Cepero M. Using machine learning techniques and genomic/proteomic information from known databases for defining relevant features for PPI classification. Comput Biol Med 2012; 42:639-50. [PMID: 22575173 DOI: 10.1016/j.compbiomed.2012.01.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Revised: 01/10/2012] [Accepted: 01/27/2012] [Indexed: 11/17/2022]
Abstract
In modern proteomics, prediction of protein-protein interactions (PPIs) is a key research line, as these interactions take part in most essential biological processes. In this paper, a new approach is proposed to PPI data classification based on the extraction of genomic and proteomic information from well-known databases and the incorporation of semantic measures. This approach is carried out through the application of data mining techniques and provides very accurate models with high levels of sensitivity and specificity in the classification of PPIs. The well-known support vector machine paradigm is used to learn the models, which will also return a new confidence score which may help expert researchers to filter out and validate new external PPIs. One of the most-widely analyzed organisms, yeast, will be studied. We processed a very high-confidence dataset by extracting up to 26 specific features obtained from the chosen databases, half of them calculated using two new similarity measures proposed in this paper. Then, by applying a filter-wrapper algorithm for feature selection, we obtained a final set composed of the eight most relevant features for predicting PPIs, which was validated by a ROC analysis. The prediction capability of the support vector machine model using these eight features was tested through the evaluation of the predictions obtained in a set of external experimental, computational, and literature-collected datasets.
Collapse
Affiliation(s)
- J M Urquiza
- Department of Computer Architecture and Computer Technology, University of Granada, Granada, Spain.
| | | | | | | | | | | | | |
Collapse
|
13
|
Wang J, Gao X, Wang Q, Li Y. ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval. BMC Bioinformatics 2012; 13 Suppl 7:S2. [PMID: 22594999 PMCID: PMC3348016 DOI: 10.1186/1471-2105-13-s7-s2] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database. RESULTS In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N(i) and N(j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N(i) and N(j).Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial Context Coherently in an iterative algorithm--ProDis-ContSHC.We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other unsupervised contextual dissimilarity measures that do not use the class label information. CONCLUSIONS Using the contextual proteins with their class labels in the database, we can improve the accuracy of the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other methods reported in the recent literature.
Collapse
Affiliation(s)
- Jingyan Wang
- King Abdullah University of Science and Technology (KAUST), Mathematical and Computer Sciences and Engineering Division, Thuwal, 23955-6900, Saudi Arabia
| | | | | | | |
Collapse
|