1
|
Jiang Y, Wang D, Wang W, Xu D. Computational methods for protein localization prediction. Comput Struct Biotechnol J 2021; 19:5834-5844. [PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open
Abstract
The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.
Collapse
Affiliation(s)
- Yuexu Jiang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Weiwei Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
2
|
Lu AX, Chong YT, Hsu IS, Strome B, Handfield LF, Kraus O, Andrews BJ, Moses AM. Integrating images from multiple microscopy screens reveals diverse patterns of change in the subcellular localization of proteins. eLife 2018; 7:e31872. [PMID: 29620521 PMCID: PMC5935485 DOI: 10.7554/elife.31872] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2017] [Accepted: 03/30/2018] [Indexed: 01/29/2023] Open
Abstract
The evaluation of protein localization changes on a systematic level is a powerful tool for understanding how cells respond to environmental, chemical, or genetic perturbations. To date, work in understanding these proteomic responses through high-throughput imaging has catalogued localization changes independently for each perturbation. To distinguish changes that are targeted responses to the specific perturbation or more generalized programs, we developed a scalable approach to visualize the localization behavior of proteins across multiple experiments as a quantitative pattern. By applying this approach to 24 experimental screens consisting of nearly 400,000 images, we differentiated specific responses from more generalized ones, discovered nuance in the localization behavior of stress-responsive proteins, and formed hypotheses by clustering proteins that have similar patterns. Previous approaches aim to capture all localization changes for a single screen as accurately as possible, whereas our work aims to integrate large amounts of imaging data to find unexpected new cell biology.
Collapse
Affiliation(s)
- Alex X Lu
- Department of Computer ScienceUniversity of TorontoTorontoCanada
| | - Yolanda T Chong
- Terrence Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoCanada
| | - Ian Shen Hsu
- Department of Cell and Systems BiologyUniversity of TorontoTorontoCanada
| | - Bob Strome
- Department of Cell and Systems BiologyUniversity of TorontoTorontoCanada
| | | | - Oren Kraus
- Department of Electrical and Computer EngineeringUniversity of TorontoTorontoCanada
| | - Brenda J Andrews
- Terrence Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoCanada
- Department of Molecular GeneticsUniversity of TorontoTorontoCanada
| | - Alan M Moses
- Department of Computer ScienceUniversity of TorontoTorontoCanada
- Department of Cell and Systems BiologyUniversity of TorontoTorontoCanada
- Center for Analysis of Genome Evolution and FunctionUniversity of TorontoTorontoCanada
| |
Collapse
|
3
|
Liu Z, Hu J. Mislocalization-related disease gene discovery using gene expression based computational protein localization prediction. Methods 2015; 93:119-27. [PMID: 26416496 DOI: 10.1016/j.ymeth.2015.09.022] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Revised: 09/17/2015] [Accepted: 09/21/2015] [Indexed: 01/09/2023] Open
Abstract
Protein sorting is an important mechanism for transporting proteins to their target subcellular locations after their synthesis. Mutations on genes may disrupt the well regulated protein sorting process, leading to a variety of mislocation related diseases. This paper proposes a methodology to discover such disease genes based on gene expression data and computational protein localization prediction. A kernel logistic regression based algorithm is used to successfully identify several candidate cancer genes which may cause cancers due to their mislocation within the cell. Our results also showed that compared to the gene co-expression network defined on Pearson correlation coefficients, the nonlinear Maximum Correlation Coefficients (MIC) based co-expression network give better results for subcellular localization prediction.
Collapse
Affiliation(s)
- Zhonghao Liu
- Department of Computer Science & Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, United States
| | - Jianjun Hu
- Department of Computer Science & Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, United States.
| |
Collapse
|
4
|
Wang X, Zhang J, Li GZ. Multi-location gram-positive and gram-negative bacterial protein subcellular localization using gene ontology and multi-label classifier ensemble. BMC Bioinformatics 2015; 16 Suppl 12:S1. [PMID: 26329681 PMCID: PMC4705491 DOI: 10.1186/1471-2105-16-s12-s1] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Background It has become a very important and full of challenge task to predict bacterial protein subcellular locations using computational methods. Although there exist a lot of prediction methods for bacterial proteins, the majority of these methods can only deal with single-location proteins. But unfortunately many multi-location proteins are located in the bacterial cells. Moreover, multi-location proteins have special biological functions capable of helping the development of new drugs. So it is necessary to develop new computational methods for accurately predicting subcellular locations of multi-location bacterial proteins. Results In this article, two efficient multi-label predictors, Gpos-ECC-mPLoc and Gneg-ECC-mPLoc, are developed to predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. The two multi-label predictors construct the GO vectors by using the GO terms of homologous proteins of query proteins and then adopt a powerful multi-label ensemble classifier to make the final multi-label prediction. The two multi-label predictors have the following advantages: (1) they improve the prediction performance of multi-label proteins by taking the correlations among different labels into account; (2) they ensemble multiple CC classifiers and further generate better prediction results by ensemble learning; and (3) they construct the GO vectors by using the frequency of occurrences of GO terms in the typical homologous set instead of using 0/1 values. Experimental results show that Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. Conclusions Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently improve prediction accuracy of subcellular localization of multi-location gram-positive and gram-negative bacterial proteins respectively. The online web servers for Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors are freely accessible at http://biomed.zzuli.edu.cn/bioinfo/gpos-ecc-mploc/ and http://biomed.zzuli.edu.cn/bioinfo/gneg-ecc-mploc/ respectively.
Collapse
|
5
|
Abstract
Protein location and function can change dynamically depending on many factors, including environmental stress, disease state, age, developmental stage, and cell type. Here, we describe an integrative computational framework, called the conditional function predictor (CoFP; http://nbm.ajou.ac.kr/cofp/), for predicting changes in subcellular location and function on a proteome-wide scale. The essence of the CoFP approach is to cross-reference general knowledge about a protein and its known network of physical interactions, which typically pool measurements from diverse environments, against gene expression profiles that have been measured under specific conditions of interest. Using CoFP, we predict condition-specific subcellular locations, biological processes, and molecular functions of the yeast proteome under 18 specified conditions. In addition to highly accurate retrieval of previously known gold standard protein locations and functions, CoFP predicts previously unidentified condition-dependent locations and functions for nearly all yeast proteins. Many of these predictions can be confirmed using high-resolution cellular imaging. We show that, under DNA-damaging conditions, Tsr1, Caf120, Dip5, Skg6, Lte1, and Nnf2 change subcellular location and RNA polymerase I subunit A43, Ino2, and Ids2 show changes in DNA binding. Beyond specific predictions, this work reveals a global landscape of changing protein location and function, highlighting a surprising number of proteins that translocate from the mitochondria to the nucleus or from endoplasmic reticulum to Golgi apparatus under stress.
Collapse
|
6
|
Zhang SW, Liu YF, Yu Y, Zhang TH, Fan XN. MSLoc-DT: A new method for predicting the protein subcellular location of multispecies based on decision templates. Anal Biochem 2014; 449:164-71. [DOI: 10.1016/j.ab.2013.12.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 11/08/2013] [Accepted: 12/12/2013] [Indexed: 12/12/2022]
|
7
|
Lee K, Byun K, Hong W, Chuang HY, Pack CG, Bayarsaikhan E, Paek SH, Kim H, Shin HY, Ideker T, Lee B. Proteome-wide discovery of mislocated proteins in cancer. Genome Res 2013; 23:1283-94. [PMID: 23674306 PMCID: PMC3730102 DOI: 10.1101/gr.155499.113] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Several studies have sought systematically to identify protein subcellular locations, but an even larger task is to map which of these proteins conditionally relocates in disease (the mislocalizome). Here, we report an integrative computational framework for mapping conditional location and mislocation of proteins on a proteome-wide scale, called a conditional location predictor (CoLP). Using CoLP, we mapped the locations of over 10,000 proteins in normal human brain and in glioma. The prediction showed 0.9 accuracy using 100 location tests of 20 randomly selected proteins. Of the 10,000 proteins, over 150 have a strong likelihood of mislocation under glioma, which is striking considering that few mislocation events have been identified in this disease previously. Using immunofluorescence and Western blotting in both primary cells and tissues, we successfully experimentally confirmed 15 mislocations. The most common type of mislocation occurs between the endoplasmic reticulum and the nucleus; for example, for RNF138, TLX3, and NFRKB. In particular, we found that the gene for the mislocating protein GFRA4 had a nonsynonymous point mutation in exon 2. Moreover, redirection of GFRA4 to its normal location, the plasma membrane, led to marked reductions in phospho-STAT3 and proliferation of glioma cells. This framework has the potential to track changes in protein location in many human diseases.
Collapse
Affiliation(s)
- KiYoung Lee
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Wan S, Mak MW, Kung SY. GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition. J Theor Biol 2013; 323:40-8. [PMID: 23376577 DOI: 10.1016/j.jtbi.2013.01.012] [Citation(s) in RCA: 82] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Revised: 01/16/2013] [Accepted: 01/16/2013] [Indexed: 01/03/2023]
Abstract
Prediction of protein subcellular localization is an important yet challenging problem. Recently, several computational methods based on Gene Ontology (GO) have been proposed to tackle this problem and have demonstrated superiority over methods based on other features. Existing GO-based methods, however, do not fully use the GO information. This paper proposes an efficient GO method called GOASVM that exploits the information from the GO term frequencies and distant homologs to represent a protein in the general form of Chou's pseudo-amino acid composition. The method first selects a subset of relevant GO terms to form a GO vector space. Then for each protein, the method uses the accession number (AC) of the protein or the ACs of its homologs to find the number of occurrences of the selected GO terms in the Gene Ontology annotation (GOA) database as a means to construct GO vectors for support vector machines (SVMs) classification. With the advantages of GO term frequencies and a new strategy to incorporate useful homologous information, GOASVM can achieve a prediction accuracy of 72.2% on a new independent test set comprising novel proteins that were added to Swiss-Prot six years later than the creation date of the training set. GOASVM and Supplementary materials are available online at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China.
| | | | | |
Collapse
|
9
|
Wan S, Mak MW, Kung SY. mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinformatics 2012; 13:290. [PMID: 23130999 PMCID: PMC3582598 DOI: 10.1186/1471-2105-13-290] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 10/24/2012] [Indexed: 12/21/2022] Open
Abstract
Background Although many computational methods have been developed to predict protein subcellular localization, most of the methods are limited to the prediction of single-location proteins. Multi-location proteins are either not considered or assumed not existing. However, proteins with multiple locations are particularly interesting because they may have special biological functions, which are essential to both basic research and drug discovery. Results This paper proposes an efficient multi-label predictor, namely mGOASVM, for predicting the subcellular localization of multi-location proteins. Given a protein, the accession numbers of its homologs are obtained via BLAST search. Then, the original accession number and the homologous accession numbers of the protein are used as keys to search against the Gene Ontology (GO) annotation database to obtain a set of GO terms. Given a set of training proteins, a set of T relevant GO terms is obtained by finding all of the GO terms in the GO annotation database that are relevant to the training proteins. These relevant GO terms then form the basis of a T-dimensional Euclidean space on which the GO vectors lie. A support vector machine (SVM) classifier with a new decision scheme is proposed to classify the multi-label GO vectors. The mGOASVM predictor has the following advantages: (1) it uses the frequency of occurrences of GO terms for feature representation; (2) it selects the relevant GO subspace which can substantially speed up the prediction without compromising performance; and (3) it adopts an efficient multi-label SVM classifier which significantly outperforms other predictors. Briefly, on two recently published virus and plant datasets, mGOASVM achieves an actual accuracy of 88.9% and 87.4%, respectively, which are significantly higher than those achieved by the state-of-the-art predictors such as iLoc-Virus (74.8%) and iLoc-Plant (68.1%). Conclusions mGOASVM can efficiently predict the subcellular locations of multi-label proteins. The mGOASVM predictor is available online at
http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/mGOASVM.html.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | | | | |
Collapse
|
10
|
He J, Gu H, Liu W. Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PLoS One 2012; 7:e37155. [PMID: 22715364 PMCID: PMC3371015 DOI: 10.1371/journal.pone.0037155] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2011] [Accepted: 04/14/2012] [Indexed: 12/20/2022] Open
Abstract
It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches.
Collapse
Affiliation(s)
- Jianjun He
- School of Control Science and Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hong Gu
- School of Control Science and Engineering, Dalian University of Technology, Dalian, Liaoning, China
- * E-mail:
| | - Wenqi Liu
- School of Control Science and Engineering, Dalian University of Technology, Dalian, Liaoning, China
| |
Collapse
|
11
|
Wu ZC, Xiao X, Chou KC. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. MOLECULAR BIOSYSTEMS 2011; 7:3287-97. [PMID: 21984117 DOI: 10.1039/c1mb05232b] [Citation(s) in RCA: 181] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Affiliation(s)
- Zhi-Cheng Wu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
| | | | | |
Collapse
|
12
|
Jung J, Ryu T, Hwang Y, Lee E, Lee D. Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics. J Comput Biol 2010; 17:97-105. [PMID: 20078400 DOI: 10.1089/cmb.2008.0236] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Extracellular matrix (ECM) proteins are secreted to the exterior of the cell, and function as mediators between resident cells and the external environment. These proteins not only support cellular structure but also participate in diverse processes, including growth, hormonal response, homeostasis, and disease progression. Despite their importance, current knowledge of the number and functions of ECM proteins is limited. Here, we propose a computational method to predict ECM proteins. Specific features, such as ECM domain score and repetitive residues, were utilized for prediction. Based on previously employed and newly generated features, discriminatory characteristics for ECM protein categorization were determined, which significantly improved the performance of Random Forest and support vector machine (SVM) classification. We additionally predicted novel ECM proteins from non-annotated human proteins, validated with gene ontology and earlier literature. Our novel prediction method is available at biosoft.kaist.ac.kr/ecm.
Collapse
Affiliation(s)
- Juhyun Jung
- Department of Bio and Brain Engineering , KAIST, Daejeon, Korea
| | | | | | | | | |
Collapse
|
13
|
Lin HN, Chen CT, Sung TY, Ho SY, Hsu WL. Protein subcellular localization prediction of eukaryotes using a knowledge-based approach. BMC Bioinformatics 2009; 10 Suppl 15:S8. [PMID: 19958518 PMCID: PMC2788359 DOI: 10.1186/1471-2105-10-s15-s8] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. RESULTS In this study, we propose a knowledge based method, called KnowPredsite, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPred site's performance. The experiment results show that KnowPred site achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPred site is 91.7%. For multi-localized proteins, the overall accuracy of KnowPred site is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPred site. CONCLUSION KnowPred site demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPred site is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPred site is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPred site prediction server is available at http://bio-cluster.iis.sinica.edu.tw/kbloc/.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan, Republic of China.
| | | | | | | | | |
Collapse
|
14
|
A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Anal Biochem 2009; 394:269-74. [DOI: 10.1016/j.ab.2009.07.046] [Citation(s) in RCA: 135] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2009] [Revised: 07/28/2009] [Accepted: 07/28/2009] [Indexed: 12/12/2022]
|
15
|
Tung TQ, Lee D. A method to improve protein subcellular localization prediction by integrating various biological data sources. BMC Bioinformatics 2009; 10 Suppl 1:S43. [PMID: 19208145 PMCID: PMC2648781 DOI: 10.1186/1471-2105-10-s1-s43] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Protein subcellular localization is crucial information to elucidate protein functions. Owing to the need for large-scale genome analysis, computational method for efficiently predicting protein subcellular localization is highly required. Although many previous works have been done for this task, the problem is still challenging due to several reasons: the number of subcellular locations in practice is large; distribution of protein in locations is imbalanced, that is the number of protein in each location remarkably different; and there are many proteins located in multiple locations. Thus it is necessary to explore new features and appropriate classification methods to improve the prediction performance. Results In this paper we propose a new predicting method which combines two key ideas: 1) Information of neighbour proteins in a probabilistic gene network is integrated to enrich the prediction features. 2) Fuzzy k-NN, a classification method based on fuzzy set theory is applied to predict protein locating in multiple sites. Experiment was conducted on a dataset consisting of 22 locations from Budding yeast proteins and significant improvement was observed. Conclusion Our results suggest that the neighbourhood information from functional gene networks is predictive to subcellular localization. The proposed method thus can be integrated and complementary to other available prediction methods.
Collapse
Affiliation(s)
- Thai Quang Tung
- Department of Bio & Brain Engineering, KAIST, Daejeon City, Republic of Korea.
| | | |
Collapse
|
16
|
Lee K, Chuang HY, Beyer A, Sung MK, Huh WK, Lee B, Ideker T. Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species. Nucleic Acids Res 2008; 36:e136. [PMID: 18836191 PMCID: PMC2582614 DOI: 10.1093/nar/gkn619] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
The function of a protein is intimately tied to its subcellular localization. Although localizations have been measured for many yeast proteins through systematic GFP fusions, similar studies in other branches of life are still forthcoming. In the interim, various machine-learning methods have been proposed to predict localization using physical characteristics of a protein, such as amino acid content, hydrophobicity, side-chain mass and domain composition. However, there has been comparatively little work on predicting localization using protein networks. Here, we predict protein localizations by integrating an extensive set of protein physical characteristics over a protein's extended protein–protein interaction neighborhood, using a classification framework called ‘Divide and Conquer k-Nearest Neighbors’ (DC-kNN). These predictions achieve significantly higher accuracy than two well-known methods for predicting protein localization in yeast. Using new GFP imaging experiments, we show that the network-based approach can extend and revise previous annotations made from high-throughput studies. Finally, we show that our approach remains highly predictive in higher eukaryotes such as fly and human, in which most localizations are unknown and the protein network coverage is less substantial.
Collapse
Affiliation(s)
- Kiyoung Lee
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | | | | | | | | | | | | |
Collapse
|
17
|
An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence. Amino Acids 2008; 35:573-80. [DOI: 10.1007/s00726-008-0083-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2008] [Accepted: 02/26/2008] [Indexed: 11/26/2022]
|
18
|
Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc 2008; 3:153-62. [PMID: 18274516 DOI: 10.1038/nprot.2007.494] [Citation(s) in RCA: 690] [Impact Index Per Article: 43.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
19
|
Nanni L, Lumini A. Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids 2008; 34:653-60. [PMID: 18175047 DOI: 10.1007/s00726-007-0018-1] [Citation(s) in RCA: 124] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2007] [Accepted: 12/11/2007] [Indexed: 01/25/2023]
|
20
|
Tantoso E, Li KB. AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices. Amino Acids 2007; 35:345-53. [DOI: 10.1007/s00726-007-0616-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2007] [Accepted: 10/04/2007] [Indexed: 10/22/2022]
|
21
|
Chou KC, Shen HB. Recent progress in protein subcellular location prediction. Anal Biochem 2007; 370:1-16. [PMID: 17698024 DOI: 10.1016/j.ab.2007.07.006] [Citation(s) in RCA: 603] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2007] [Revised: 07/02/2007] [Accepted: 07/04/2007] [Indexed: 01/16/2023]
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
22
|
Shen HB, Yang J, Chou KC. Methodology development for predicting subcellular localization and other attributes of proteins. Expert Rev Proteomics 2007; 4:453-63. [PMID: 17705704 DOI: 10.1586/14789450.4.4.453] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Facing the explosion of newly generated protein sequences in the postgenomic age, we are challenged to develop computational methods for the fast and accurate identification of their subcellular localization and other attributes. This review summarizes recent methodology developments, with a focus on artificial neural networks, the statistical learning and support vector machine, the fuzzy logic-based algorithm and the evidence-theory-based algorithm, as well as the ensemble classifier approach. Meanwhile, an outline of the use of different descriptors for protein samples is given. In addition, a series of web servers established recently based on various ensemble classifiers are also briefly introduced.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Shanghai Jiaotong University, Institute of Image Processing & Pattern Recognition, Shanghai, China.
| | | | | |
Collapse
|
23
|
Mundra P, Kumar M, Kumar KK, Jayaraman VK, Kulkarni BD. Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognit Lett 2007. [DOI: 10.1016/j.patrec.2007.04.001] [Citation(s) in RCA: 90] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
24
|
Su ECY, Chiu HS, Lo A, Hwang JK, Sung TY, Hsu WL. Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinformatics 2007; 8:330. [PMID: 17825110 PMCID: PMC2040162 DOI: 10.1186/1471-2105-8-330] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2007] [Accepted: 09/08/2007] [Indexed: 01/17/2023] Open
Abstract
Background Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. Results We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. Conclusion Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.
Collapse
Affiliation(s)
- Emily Chia-Yu Su
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan
| | - Hua-Sheng Chiu
- Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Allan Lo
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Department of Life Sciences, National Tsing Hua University, Hsinchu, Taiwan
| | - Jenn-Kang Hwang
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan
| | - Ting-Yi Sung
- Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
25
|
Emanuelsson O, Brunak S, von Heijne G, Nielsen H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc 2007; 2:953-71. [PMID: 17446895 DOI: 10.1038/nprot.2007.131] [Citation(s) in RCA: 2458] [Impact Index Per Article: 144.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
| | | | | | | |
Collapse
|
26
|
Shen HB, Chou KC. Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun 2007; 355:1006-11. [PMID: 17346678 DOI: 10.1016/j.bbrc.2007.02.071] [Citation(s) in RCA: 147] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2007] [Accepted: 02/09/2007] [Indexed: 10/23/2022]
Abstract
Proteins may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery. For instance, among the 6408 human protein entries that have experimentally observed subcellular location annotations in the Swiss-Prot database (version 50.7, released 19-Sept-2006), 973 ( approximately 15%) have multiple location sites. The number of total human protein entries (except those annotated with "fragment" or those with less than 50 amino acids) in the same database is 14,370, meaning a gap of (14,370-6408)=7962 entries for which no knowledge is available about their subcellular locations. Although one can use the computational approach to predict the desired information for the gap, so far all the existing methods for predicting human protein subcellular localization are limited in the case of single location site only. To overcome such a barrier, a new ensemble classifier, named Hum-mPLoc, was developed that can be used to deal with the case of multiple location sites as well. Hum-mPLoc is freely accessible to the public as a web server at http://202.120.37.186/bioinf/hum-multi. Meanwhile, for the convenience of people working in the relevant areas, Hum-mPLoc has been used to identify all human protein entries in the Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The large-scale results thus obtained have been deposited in a downloadable file prepared with Microsoft Excel and named "Tab_Hum-mPLoc.xls". This file is available at the same website and will be updated twice a year to include new entries of human proteins and reflect the continuous development of Hum-mPLoc.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Department of Biological Chemistry & Molecular Pharmacology, Harvard Medical School, Boston, MA 02115, USA.
| | | |
Collapse
|