1
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
2
|
Ahmad EM, Abdelsamad A, El-Shabrawi HM, El-Awady MAM, Aly MAM, El-Soda M. In-silico identification of putatively functional intergenic small open reading frames in the cucumber genome and their predicted response to biotic and abiotic stresses. PLANT, CELL & ENVIRONMENT 2024. [PMID: 39189930 DOI: 10.1111/pce.15104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Revised: 07/13/2024] [Accepted: 08/10/2024] [Indexed: 08/28/2024]
Abstract
The availability of high-throughput sequencing technologies increased our understanding of different genomes. However, the genomes of all living organisms still have many unidentified coding sequences. The increased number of missing small open reading frames (sORFs) is due to the length threshold used in most gene identification tools, which is true in the genic and, more importantly and surprisingly, in the intergenic regions. Scanning the cucumber genome intergenic regions revealed 420 723 sORF. We excluded 3850 sORF with similarities to annotated cucumber proteins. To propose the functionality of the remaining 416 873 sORF, we calculated their codon adaptation index (CAI). We found 398 937 novel sORF (nsORF) with CAI ≥ 0.7 that were further used for downstream analysis. Searching against the Rfam database revealed 109 nsORFs similar to multiple RNA families. Using SignalP-5.0 and NLS, identified 11 592 signal peptides. Five predicted proteins interacting with Meloidogyne incognita and Powdery mildew proteins were selected using published transcriptome data of host-pathogen interactions. Gene ontology enrichment interpreted the function of those proteins, illustrating that nsORFs' expression could contribute to the cucumber's response to biotic and abiotic stresses. This research highlights the importance of previously overlooked nsORFs in the cucumber genome and provides novel insights into their potential functions.
Collapse
Affiliation(s)
- Esraa M Ahmad
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| | - Ahmed Abdelsamad
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| | - Hattem M El-Shabrawi
- Plant Biotechnology Department, Genetic Engineering & Biotechnology Division, National Research Center, Giza, Egypt
| | | | - Mohammed A M Aly
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| | - Mohamed El-Soda
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| |
Collapse
|
3
|
Wattanapornprom W, Thammarongtham C, Hongsthong A, Lertampaiporn S. Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization. Life (Basel) 2021; 11:life11040293. [PMID: 33808227 PMCID: PMC8066735 DOI: 10.3390/life11040293] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 03/16/2021] [Accepted: 03/25/2021] [Indexed: 12/17/2022] Open
Abstract
The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.
Collapse
Affiliation(s)
- Warin Wattanapornprom
- Applied Computer Science Program, Department of Mathematics, Faculty of Science, King Mongkut’s University of Technology Thonburi, Bangkok 10140, Thailand;
| | - Chinae Thammarongtham
- Biochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand; (C.T.); (A.H.)
| | - Apiradee Hongsthong
- Biochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand; (C.T.); (A.H.)
| | - Supatcha Lertampaiporn
- Biochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand; (C.T.); (A.H.)
- Correspondence:
| |
Collapse
|
4
|
Kennedy K, Cal R, Casey R, Lopez C, Adelfio A, Molloy B, Wall AM, Holton TA, Khaldi N. The anti-ageing effects of a natural peptide discovered by artificial intelligence. Int J Cosmet Sci 2020; 42:388-398. [PMID: 32453870 DOI: 10.1111/ics.12635] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 04/08/2020] [Accepted: 05/20/2020] [Indexed: 01/03/2023]
Abstract
OBJECTIVE As skin ages, impaired extracellular matrix (ECM) protein synthesis and increased action of degradative enzymes manifest as atrophy, wrinkling and laxity. There is mounting evidence for the functional role of exogenous peptides across many areas, including in offsetting the effects of cutaneous ageing. Here, using an artificial intelligence (AI) approach, we identified peptide RTE62G (pep_RTE62G), a naturally occurring, unmodified peptide with ECM stimulatory properties. The AI-predicted anti-ageing properties of pep_RTE62G were then validated through in vitro, ex vivo and proof of concept clinical testing. METHODS A deep learning approach was applied to unlock pep_RTE62G from a plant source, Pisum sativum (pea). Cell culture assays of human dermal fibroblasts (HDFs) and keratinocytes (HaCaTs) were subsequently used to evaluate the in vitro effect of pep_RTE62G. Distinct activities such as cell proliferation and ECM protein production properties were determined by ELISA assays. Cell migration was assessed using a wound healing assay, while ECM protein synthesis and gene expression were analysed, respectively, by immunofluorescence microscopy and PCR. Immunohistochemistry of human skin explants was employed to further investigate the induction of ECM proteins by pep_RTE62G ex vivo. Finally, the clinical effect of pep_RTE626 was evaluated in a proof of concept 28-day pilot study. RESULTS In vitro testing confirmed that pep_RTE62G is an effective multi-functional anti-ageing ingredient. In HaCaTs, pep_RTE62G treatment significantly increases both cellular proliferation and migration. Similarly, in HDFs, pep_RTE62G consistently induced the neosynthesis of ECM protein elastin and collagen, effects that are upheld in human skin explants. Lastly, in our proof of concept clinical study, application of pep_RTE626 over 28 days demonstrated anti-wrinkle and collagen stimulatory potential. CONCLUSION pep_RTE62G represents a natural, unmodified peptide with AI-predicted and experimentally validated anti-ageing properties. Our results affirm the utility of AI in the discovery of novel, functional topical ingredients.
Collapse
Affiliation(s)
- K Kennedy
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - R Cal
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - R Casey
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - C Lopez
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - A Adelfio
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - B Molloy
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - A M Wall
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - T A Holton
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| | - N Khaldi
- Nuritas Ltd, Joshua Dawson House, Dawson St, Dublin 2, D02 RY95, Ireland
| |
Collapse
|
5
|
Bouziane H, Chouarfia A. Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment. J Integr Bioinform 2020; 18:51-79. [PMID: 32598314 PMCID: PMC8035964 DOI: 10.1515/jib-2019-0091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.
Collapse
Affiliation(s)
- Hafida Bouziane
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| | - Abdallah Chouarfia
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| |
Collapse
|
6
|
Sahu SS, Loaiza CD, Kaundal R. Plant-mSubP: a computational framework for the prediction of single- and multi-target protein subcellular localization using integrated machine-learning approaches. AOB PLANTS 2020; 12:plz068. [PMID: 32528639 PMCID: PMC7274489 DOI: 10.1093/aobpla/plz068] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 10/11/2019] [Indexed: 05/18/2023]
Abstract
The subcellular localization of proteins is very important for characterizing its function in a cell. Accurate prediction of the subcellular locations in computational paradigm has been an active area of interest. Most of the work has been focused on single localization prediction. Only few studies have discussed the multi-target localization, but have not achieved good accuracy so far; in plant sciences, very limited work has been done. Here we report the development of a novel tool Plant-mSubP, which is based on integrated machine learning approaches to efficiently predict the subcellular localizations in plant proteomes. The proposed approach predicts with high accuracy 11 single localizations and three dual locations of plant cell. Several hybrid features based on composition and physicochemical properties of a protein such as amino acid composition, pseudo amino acid composition, auto-correlation descriptors, quasi-sequence-order descriptors and hybrid features are used to represent the protein. The performance of the proposed method has been assessed through a training set as well as an independent test set. Using the hybrid feature of the pseudo amino acid composition, N-Center-C terminal amino acid composition and the dipeptide composition (PseAAC-NCC-DIPEP), an overall accuracy of 81.97 %, 84.75 % and 87.88 % is achieved on the training data set of proteins containing the single-label, single- and dual-label combined, and dual-label proteins, respectively. When tested on the independent data, an accuracy of 64.36 %, 64.84 % and 81.08 % is achieved on the single-label, single- and dual-label, and dual-label proteins, respectively. The prediction models have been implemented on a web server available at http://bioinfo.usu.edu/Plant-mSubP/. The results indicate that the proposed approach is comparable to the existing methods in single localization prediction and outperforms all other existing tools when compared for dual-label proteins. The prediction tool will be a useful resource for better annotation of various plant proteomes.
Collapse
Affiliation(s)
- Sitanshu S Sahu
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra, Ranchi, India
| | - Cristian D Loaiza
- Department of Plants, Soils, and Climate/Center for Integrated BioSystems, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
| | - Rakesh Kaundal
- Department of Plants, Soils, and Climate/Center for Integrated BioSystems, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
- Corresponding author’s e-mail address:
| |
Collapse
|
7
|
Kunze M. Predicting Peroxisomal Targeting Signals to Elucidate the Peroxisomal Proteome of Mammals. Subcell Biochem 2018; 89:157-199. [PMID: 30378023 DOI: 10.1007/978-981-13-2233-4_7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Peroxisomes harbor a plethora of proteins, but the peroxisomal proteome as the entirety of all peroxisomal proteins is still unknown for mammalian species. Computational algorithms can be used to predict the subcellular localization of proteins based on their amino acid sequence and this method has been amply used to forecast the intracellular fate of individual proteins. However, when applying such algorithms systematically to all proteins of an organism the prediction of its peroxisomal proteome in silico should be possible. Therefore, a reliable detection of peroxisomal targeting signals (PTS ) acting as postal codes for the intracellular distribution of the encoding protein is crucial. Peroxisomal proteins can utilize different routes to reach their destination depending on the type of PTS. Accordingly, independent prediction algorithms have been developed for each type of PTS, but only those for type-1 motifs (PTS1) have so far reached a satisfying predictive performance. This is partially due to the low number of peroxisomal proteins limiting the power of statistical analyses and partially due to specific properties of peroxisomal protein import, which render functional PTS motifs inactive in specific contexts. Moreover, the prediction of the peroxisomal proteome is limited by the high number of proteins encoded in mammalian genomes, which causes numerous false positive predictions even when using reliable algorithms and buries the few yet unidentified peroxisomal proteins. Thus, the application of prediction algorithms to identify all peroxisomal proteins is currently ineffective as stand-alone method, but can display its full potential when combined with other methods.
Collapse
Affiliation(s)
- Markus Kunze
- Department of Pathobiology of the Nervous System, Center for Brain Research, Medical University of Vienna, Vienna, Austria.
| |
Collapse
|
8
|
Volpato V, Alshomrani B, Pollastri G. Accurate Ab Initio and Template-Based Prediction of Short Intrinsically-Disordered Regions by Bidirectional Recurrent Neural Networks Trained on Large-Scale Datasets. Int J Mol Sci 2015; 16:19868-85. [PMID: 26307973 PMCID: PMC4581330 DOI: 10.3390/ijms160819868] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 07/28/2015] [Accepted: 07/29/2015] [Indexed: 12/02/2022] Open
Abstract
Intrinsically-disordered regions lack a well-defined 3D structure, but play key roles in determining the function of many proteins. Although predictors of disorder have been shown to achieve relatively high rates of correct classification of these segments, improvements over the the years have been slow, and accurate methods are needed that are capable of accommodating the ever-increasing amount of structurally-determined protein sequences to try to boost predictive performances. In this paper, we propose a predictor for short disordered regions based on bidirectional recurrent neural networks and tested by rigorous five-fold cross-validation on a large, non-redundant dataset collected from MobiDB, a new comprehensive source of protein disorder annotations. The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank. The contributions of sequence, structure and homology information result in large improvements in predictive accuracy. Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.
Collapse
Affiliation(s)
- Viola Volpato
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland.
- Adaptive and Complex Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland.
| | - Badr Alshomrani
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland.
- Adaptive and Complex Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland.
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland.
- Adaptive and Complex Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland.
| |
Collapse
|
9
|
Bu Y, Zhao M, Sun B, Zhang X, Takano T, Liu S. An efficient method for stable protein targeting in grasses (Poaceae): a case study in Puccinellia tenuiflora. BMC Biotechnol 2014; 14:52. [PMID: 24898217 PMCID: PMC4064272 DOI: 10.1186/1472-6750-14-52] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2014] [Accepted: 05/26/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An efficient transformation method is lacking for most non-model plant species to test gene function. Therefore, subcellular localization of proteins of interest from non-model plants is mainly carried out through transient transformation in homologous cells or in heterologous cells from model species such as Arabidopsis. Although analysis of expression patterns in model organisms like yeast and Arabidopsis can provide important clues about protein localization, these heterologous systems may not always faithfully reflect the native subcellular distribution in other species. On the other hand, transient expression in protoplasts from species of interest has limited ability for detailed sub-cellular localization analysis (e.g., those involving subcellular fractionation or sectioning and immunodetection), as it results in heterogeneous populations comprised of both transformed and untransformed cells. RESULTS We have developed a simple and reliable method for stable transformation of plant cell suspensions that are suitable for protein subcellular localization analyses in the non-model monocotyledonous plant Puccinellia tenuiflora. Optimization of protocols for obtaining suspension-cultured cells followed by Agrobacterium-mediated genetic transformation allowed us to establish stably transformed cell lines, which could be maintained indefinitely in axenic culture supplied with the proper antibiotic. As a case study, protoplasts of transgenic cell lines stably transformed with an ammonium transporter-green fluorescent protein (PutAMT1;1-GFP) fusion were successfully used for subcellular localization analyses in P. tenuiflora. CONCLUSIONS We present a reliable method for the generation of stably transformed P. tenuiflora cell lines, which, being available in virtually unlimited amounts, can be conveniently used for any type of protein subcellular localization analysis required. Given its simplicity, the method can be used as reference for other non-model plant species lacking efficient regeneration protocols.
Collapse
Affiliation(s)
| | | | | | | | | | - Shenkui Liu
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration in Oil Field (SAVER), Ministry of Education, Alkali Soil Natural Environmental Science Center (ASNESC), Northeast Forestry University, Hexing Road No, 26, Xiangfang District, Harbin City, Heilongjiang Province 150040, China.
| |
Collapse
|
10
|
HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS One 2014; 9:e89545. [PMID: 24647341 PMCID: PMC3960097 DOI: 10.1371/journal.pone.0089545] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 01/23/2014] [Indexed: 12/23/2022] Open
Abstract
Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive-decision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid-feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.
Collapse
|