1
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
2
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
3
|
Ghosh D, Cabrera J. Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2817-2828. [PMID: 34129502 PMCID: PMC9923687 DOI: 10.1109/tcbb.2021.3089417] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Ensemble methods such as random forest works well on high-dimensional datasets. However, when the number of features is extremely large compared to the number of samples and the percentage of truly informative feature is very small, performance of traditional random forest decline significantly. To this end, we develop a novel approach that enhance the performance of traditional random forest by reducing the contribution of trees whose nodes are populated with less informative features. The proposed method selects eligible subsets at each node by weighted random sampling as opposed to simple random sampling in traditional random forest. We refer to this modified random forest algorithm as "Enriched Random Forest". Using several high-dimensional micro-array datasets, we evaluate the performance of our approach in both regression and classification settings. In addition, we also demonstrate the effectiveness of balanced leave-one-out cross-validation to reduce computational load and decrease sample size while computing feature weights. Overall, the results indicate that enriched random forest improves the prediction accuracy of traditional random forest, especially when relevant features are very few.
Collapse
|
4
|
Pan G, Sun C, Liao Z, Tang J. Machine and Deep Learning for Prediction of Subcellular Localization. Methods Mol Biol 2022; 2361:249-261. [PMID: 34236666 DOI: 10.1007/978-1-0716-1641-3_15] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Protein subcellular localization prediction (PSLP), which plays an important role in the field of computational biology, identifies the position and function of proteins in cells without expensive cost and laborious effort. In the past few decades, various methods with different algorithms have been proposed in solving the problem of subcellular localization prediction; machine learning and deep learning constitute a large portion among those proposed methods. In order to provide an overview about those methods, the first part of this article will be a brief review of several state-of-the-art machine learning methods on subcellular localization prediction; then the materials used by subcellular localization prediction is described and a simple prediction method, that takes protein sequences as input and utilizes a convolutional neural network as the classifier, is introduced. At last, a list of notes is provided to indicate the major problems that may occur with this method.
Collapse
Affiliation(s)
- Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Zijun Liao
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fujian, China
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA. .,School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
5
|
Liao Z, Pan G, Sun C, Tang J. Predicting subcellular location of protein with evolution information and sequence-based deep learning. BMC Bioinformatics 2021; 22:515. [PMID: 34686152 PMCID: PMC8539821 DOI: 10.1186/s12859-021-04404-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 09/24/2021] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. RESULTS Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. CONCLUSION The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.
Collapse
Affiliation(s)
- Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, 1 Xuefu North Road, University Town, Fuzhou, 350122 FJ China
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
- College of Electrical and Power Engineering, Taiyuan University of Technology, No. 79 Yinze West Street, Taiyuan, 030024 SX China
| |
Collapse
|
6
|
Abstract
The elucidation of the subcellular localization of proteins is very important in order to deeply understand their functions. In fact, proteins activities are strictly correlated to the cellular compartment and microenvironment in which they are present.In recent years, several effective and reliable proteomics techniques and computational methods have been developed and implemented in order to identify the proteins subcellular localization. This process is often time-consuming and expensive, but the recent technological and bioinformatics progress allowed the development of more accurate and simple workflows to determine the localization, interactions, and functions of proteins.In the following chapter, a brief introduction on the importance of knowing subcellular localization of proteins will be presented. Then, sample preparation protocols, proteomic methods, data analysis strategies, and software for the prediction of proteins localization will be presented and discussed. Finally, the more recent and advanced spatial proteomics techniques will be shown.
Collapse
Affiliation(s)
- Elettra Barberis
- Department of Translational Medicine, University of Piemonte Orientale, Novara, Italy
- Center for Translational Research on Autoimmune and Allergic Diseases, CAAD, University of Piemonte Orientale, Novara, Italy
| | - Emilio Marengo
- Department of Sciences and Technological Innovation, University of Piemonte Orientale, Alessandria, Italy
- Center for Translational Research on Autoimmune and Allergic Diseases, CAAD, University of Piemonte Orientale, Novara, Italy
| | - Marcello Manfredi
- Department of Translational Medicine, University of Piemonte Orientale, Novara, Italy.
- Center for Translational Research on Autoimmune and Allergic Diseases, CAAD, University of Piemonte Orientale, Novara, Italy.
| |
Collapse
|
7
|
Ding Y, Tang J, Guo F. Human protein subcellular localization identification via fuzzy model on Kernelized Neighborhood Representation. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106596] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
8
|
Garapati HS, Male G, Mishra K. Predicting subcellular localization of proteins using protein-protein interaction data. Genomics 2020; 112:2361-2368. [PMID: 31945465 DOI: 10.1016/j.ygeno.2020.01.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 01/01/2020] [Accepted: 01/11/2020] [Indexed: 10/25/2022]
Abstract
The knowledge of subcellular localization of proteins can provide useful clues about their functions. The conventional methods to determine the subcellular localization are unable to keep pace with the rate at which the new data is being generated. Thus, though sequence information is available, the localization and function of a number of proteins remains unknown. In this study, we have developed a script that makes use of the physical interactors of a protein and their localization data to predict the subcellular localization. We used the script to predict the localization of yeast proteins for which there is no localization data. Further, we experimentally verified the predicted localization for six arbitrarily chosen proteins and found our predictions to be correct for five of the proteins.
Collapse
Affiliation(s)
- Hita Sony Garapati
- Department of Biochemistry, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | - Gurranna Male
- Department of Biochemistry, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | - Krishnaveni Mishra
- Department of Biochemistry, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India.
| |
Collapse
|
9
|
Shen Y, Ding Y, Tang J, Zou Q, Guo F. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform 2019; 21:1628-1640. [DOI: 10.1093/bib/bbz106] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 07/23/2019] [Accepted: 07/27/2019] [Indexed: 11/12/2022] Open
Abstract
Abstract
Human protein subcellular localization has an important research value in biological processes, also in elucidating protein functions and identifying drug targets. Over the past decade, a number of protein subcellular localization prediction tools have been designed and made freely available online. The purpose of this paper is to summarize the progress of research on the subcellular localization of human proteins in recent years, including commonly used data sets proposed by the predecessors and the performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO-Hum and pLoc-mHum provide a statistically significant improvement in performance, as measured by the value of accuracy, relative to the other methods. Meanwhile, we build a new data set using the latest version of Uniprot database and construct a new GO-based prediction method HumLoc-LBCI in this paper. Then, we test all selected prediction tools on the new data set. Finally, we discuss the possible development directions of human protein subcellular localization. Availability: The codes and data are available from http://www.lbci.cn/syn/.
Collapse
Affiliation(s)
- Yinan Shen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- School of Computational Science and Engineering, University of South Carolina, Columbia, U.S
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
10
|
Pang L, Wang J, Zhao L, Wang C, Zhan H. A Novel Protein Subcellular Localization Method With CNN-XGBoost Model for Alzheimer's Disease. Front Genet 2019; 9:751. [PMID: 30713552 PMCID: PMC6345701 DOI: 10.3389/fgene.2018.00751] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 12/31/2018] [Indexed: 12/26/2022] Open
Abstract
The disorder distribution of protein in the compartment or organelle leads to many human diseases, including neurodegenerative diseases such as Alzheimer's disease. The prediction of protein subcellular localization play important roles in the understanding of the mechanism of protein function, pathogenes and disease therapy. This paper proposes a novel subcellular localization method by integrating the Convolutional Neural Network (CNN) and eXtreme Gradient Boosting (XGBoost), where CNN acts as a feature extractor to automatically obtain features from the original sequence information and a XGBoost classifier as a recognizer to identify the protein subcellular localization based on the output of the CNN. Experiments are implemented on three protein datasets. The results prove that the CNN-XGBoost method performs better than the general protein subcellular localization methods.
Collapse
Affiliation(s)
- Long Pang
- Harbin Nebula Bioinformatics Technology Development Co., Ltd., Harbin, China
| | - Junjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Lingling Zhao
- School of Electronic Engineering, Heilongjiang University, Harbin, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Hui Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
11
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
12
|
Wang L, Zhao Y, Chen Y, Wang D. The effect of three novel feature extraction methods on the prediction of the subcellular localization of multi-site virus proteins. Bioengineered 2018; 9:196-202. [PMID: 28886267 PMCID: PMC5972939 DOI: 10.1080/21655979.2017.1373536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 07/05/2017] [Indexed: 11/08/2022] Open
Abstract
Experimental methods play a crucial role in identifying the subcellular localization of proteins and building high-quality databases. However, more efficient, automated computational methods are required to predict the subcellular localization of proteins on a large scale. Various efficient feature extraction methods have been proposed to predict subcellular localization, but challenges remain. In this paper, three novel feature extraction methods are established to improve multi-site prediction. The first novel feature extraction method utilizes repetitive information via moving windows based on a dipeptide pseudo amino acid composition method (R-Dipeptide). The second novel feature extraction method utilizes the impact of each amino acid residue on its following residues based on pseudo amino acids (I-PseAAC). The third novel feature extraction method provides local information about protein sequences that reflects the strength of the physicochemical properties of residues (PseAAC2). The multi-label k-nearest neighbor algorithm (MLKNN) is used to predict the subcellular localization of multi-site virus proteins. The best overall accuracy values of R-Dipeptide, I-PseAAC, and PseAAC2 when applied to dataset S from Virus-mPloc are 59.92%, 59.13%, and 57.94% respectively.
Collapse
Affiliation(s)
- Lei Wang
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Yaou Zhao
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Dong Wang
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| |
Collapse
|
13
|
Zhou H, Yang Y, Shen HB. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics 2017; 33:843-853. [PMID: 27993784 DOI: 10.1093/bioinformatics/btw723] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Accepted: 11/17/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Protein subcellular localization prediction has been an important research topic in computational biology over the last decade. Various automatic methods have been proposed to predict locations for large scale protein datasets, where statistical machine learning algorithms are widely used for model construction. A key step in these predictors is encoding the amino acid sequences into feature vectors. Many studies have shown that features extracted from biological domains, such as gene ontology and functional domains, can be very useful for improving the prediction accuracy. However, domain knowledge usually results in redundant features and high-dimensional feature spaces, which may degenerate the performance of machine learning models. Results In this paper, we propose a new amino acid sequence-based human protein subcellular location prediction approach Hum-mPLoc 3.0, which covers 12 human subcellular localizations. The sequences are represented by multi-view complementary features, i.e. context vocabulary annotation-based gene ontology (GO) terms, peptide-based functional domains, and residue-based statistical features. To systematically reflect the structural hierarchy of the domain knowledge bases, we propose a novel feature representation protocol denoted as HCM (Hidden Correlation Modeling), which will create more compact and discriminative feature vectors by modeling the hidden correlations between annotation terms. Experimental results on four benchmark datasets show that HCM improves prediction accuracy by 5-11% and F 1 by 8-19% compared with conventional GO-based methods. A large-scale application of Hum-mPLoc 3.0 on the whole human proteome reveals proteins co-localization preferences in the cell. Availability and Implementation www.csbio.sjtu.edu.cn/bioinf/Hum-mPLoc3/. Contacts hbshen@sjtu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hang Zhou
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
14
|
Wan S, Mak MW, Kung SY. Transductive Learning for Multi-Label Protein Subchloroplast Localization Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:212-224. [PMID: 26887009 DOI: 10.1109/tcbb.2016.2527657] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Predicting the localization of chloroplast proteins at the sub-subcellular level is an essential yet challenging step to elucidate their functions. Most of the existing subchloroplast localization predictors are limited to predicting single-location proteins and ignore the multi-location chloroplast proteins. While recent studies have led to some multi-location chloroplast predictors, they usually perform poorly. This paper proposes an ensemble transductive learning method to tackle this multi-label classification problem. Specifically, given a protein in a dataset, its composition-based sequence information and profile-based evolutionary information are respectively extracted. These two kinds of features are respectively compared with those of other proteins in the dataset. The comparisons lead to two similarity vectors which are weighted-combined to constitute an ensemble feature vector. A transductive learning model based on the least squares and nearest neighbor algorithms is proposed to process the ensemble features. We refer to the resulting predictor to as EnTrans-Chlo. Experimental results on a stringent benchmark dataset and a novel dataset demonstrate that EnTrans-Chlo significantly outperforms state-of-the-art predictors and particularly gains more than 4% (absolute) improvement on the overall actual accuracy. For readers' convenience, EnTrans-Chlo is freely available online at http://bioinfo.eie.polyu.edu.hk/EnTransChloServer/.
Collapse
|
15
|
Wan S, Mak MW, Kung SY. Ensemble Linear Neighborhood Propagation for Predicting Subchloroplast Localization of Multi-Location Proteins. J Proteome Res 2016; 15:4755-4762. [DOI: 10.1021/acs.jproteome.6b00686] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Shibiao Wan
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Man-Wai Mak
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Sun-Yuan Kung
- Department
of Electrical Engineering, Princeton University, New Jersey 08540, United States
| |
Collapse
|
16
|
Predicting protein subcellular localization based on information content of gene ontology terms. Comput Biol Chem 2016; 65:1-7. [PMID: 27665466 DOI: 10.1016/j.compbiolchem.2016.09.009] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Revised: 07/10/2016] [Accepted: 09/11/2016] [Indexed: 01/11/2023]
Abstract
Predicting the location where a protein resides within a cell is important in cell biology. Computational approaches to this issue have attracted more and more attentions from the community of biomedicine. Among the protein features used to predict the subcellular localization of proteins, the feature derived from Gene Ontology (GO) has been shown to be superior to others. However, most of the sights in this field are set on the presence or absence of some predefined GO terms. We proposed a method to derive information from the intrinsic structure of the GO graph. The feature vector was constructed with each element in it representing the information content of the GO term annotating to a protein investigated, and the support vector machines was used as classifier to test our extracted features. Evaluation experiments were conducted on three protein datasets and the results show that our method can enhance eukaryotic and human subcellular location prediction accuracy by up to 1.1% better than previous studies that also used GO-based features. Especially in the scenario where the cellular component annotation is absent, our method can achieved satisfied results with an overall accuracy of more than 87%.
Collapse
|
17
|
Wan S, Mak MW, Kung SY. Mem-mEN: Predicting Multi-Functional Types of Membrane Proteins by Interpretable Elastic Nets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:706-718. [PMID: 26336143 DOI: 10.1109/tcbb.2015.2474407] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Membrane proteins play important roles in various biological processes within organisms. Predicting the functional types of membrane proteins is indispensable to the characterization of membrane proteins. Recent studies have extended to predicting single- and multi-type membrane proteins. However, existing predictors perform poorly and more importantly, they are often lack of interpretability. To address these problems, this paper proposes an efficient predictor, namely Mem-mEN, which can produce sparse and interpretable solutions for predicting membrane proteins with single- and multi-label functional types. Given a query membrane protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number, which is subsequently classified by a multi-label elastic net (EN) classifier. Experimental results show that Mem-mEN significantly outperforms existing state-of-the-art membrane-protein predictors. Moreover, by using Mem-mEN, 338 out of more than 7,900 GO terms are found to play more essential roles in determining the functional types. Based on these 338 essential GO terms, Mem-mEN can not only predict the functional type of a membrane protein, but also explain why it belongs to that type. For the reader's convenience, the Mem-mEN server is available online at http://bioinfo.eie.polyu.edu.hk/MemmENServer/.
Collapse
|
18
|
Wan S, Mak MW, Kung SY. Mem-ADSVM: A two-layer multi-label predictor for identifying multi-functional types of membrane proteins. J Theor Biol 2016; 398:32-42. [DOI: 10.1016/j.jtbi.2016.03.013] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 03/07/2016] [Accepted: 03/07/2016] [Indexed: 02/06/2023]
|
19
|
Wan S, Mak MW, Kung SY. Benchmark data for identifying multi-functional types of membrane proteins. Data Brief 2016; 8:105-7. [PMID: 27294176 PMCID: PMC4889873 DOI: 10.1016/j.dib.2016.05.024] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Revised: 05/05/2016] [Accepted: 05/14/2016] [Indexed: 11/18/2022] Open
Abstract
Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. In this article, we provide data that are used for training and testing Mem-ADSVM (Wan et al., 2016. “Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins” [1]), a two-layer multi-label predictor for predicting multi-functional types of membrane proteins.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, New Jersey, USA
| |
Collapse
|
20
|
Wan S, Mak MW, Kung SY. Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins. BMC Bioinformatics 2016; 17:97. [PMID: 26911432 PMCID: PMC4765148 DOI: 10.1186/s12859-016-0940-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 01/27/2016] [Indexed: 11/10/2022] Open
Abstract
Background Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. Results This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. Conclusions Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers’ convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, New Jersey, USA.
| |
Collapse
|
21
|
Sharma R, Dehzangi A, Lyons J, Paliwal K, Tsunoda T, Sharma A. Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC. IEEE Trans Nanobioscience 2015; 14:915-26. [DOI: 10.1109/tnb.2015.2500186] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
22
|
Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0460-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|