1
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
2
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
3
|
Wang C, Wang Y, Ding P, Li S, Yu X, Yu B. ML-FGAT: Identification of multi-label protein subcellular localization by interpretable graph attention networks and feature-generative adversarial networks. Comput Biol Med 2024; 170:107944. [PMID: 38215617 DOI: 10.1016/j.compbiomed.2024.107944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/08/2023] [Accepted: 01/01/2024] [Indexed: 01/14/2024]
Abstract
The prediction of multi-label protein subcellular localization (SCL) is a pivotal area in bioinformatics research. Recent advancements in protein structure research have facilitated the application of graph neural networks. This paper introduces a novel approach termed ML-FGAT. The approach begins by extracting node information of proteins from sequence data, physical-chemical properties, evolutionary insights, and structural details. Subsequently, various evolutionary techniques are integrated to consolidate multi-view information. A linear discriminant analysis framework, grounded on entropy weight, is then employed to reduce the dimensionality of the merged features. To enhance the robustness of the model, the training dataset is augmented using feature-generative adversarial networks. For the primary prediction step, graph attention networks are employed to determine multi-label protein SCL, leveraging both node and neighboring information. The interpretability is enhanced by analyzing the attention weight parameters. The training is based on the Gram-positive bacteria dataset, while validation employs newly constructed datasets: human, virus, Gram-negative bacteria, plant, and SARS-CoV-2. Following a leave-one-out cross-validation procedure, ML-FGAT demonstrates noteworthy superiority in this domain.
Collapse
Affiliation(s)
- Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yifei Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Xu Yu
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
4
|
Zhou H, Tan W, Shi S. DeepGpgs: a novel deep learning framework for predicting arginine methylation sites combined with Gaussian prior and gated self-attention mechanism. Brief Bioinform 2023; 24:7000314. [PMID: 36694944 DOI: 10.1093/bib/bbad018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/26/2022] [Accepted: 01/04/2023] [Indexed: 01/26/2023] Open
Abstract
Protein arginine methylation is an important posttranslational modification (PTM) associated with protein functional diversity and pathological conditions including cancer. Identification of methylation binding sites facilitates a better understanding of the molecular function of proteins. Recent developments in the field of deep neural networks have led to a proliferation of deep learning-based methylation identification studies because of their fast and accurate prediction. In this paper, we propose DeepGpgs, an advanced deep learning model incorporating Gaussian prior and gated attention mechanism. We introduce a residual network channel to extract the evolutionary information of proteins. Then we combine the adaptive embedding with bidirectional long short-term memory networks to form a context-shared encoder layer. A gated multi-head attention mechanism is followed to obtain the global information about the sequence. A Gaussian prior is injected into the sequence to assist in predicting PTMs. We also propose a weighted joint loss function to alleviate the false negative problem. We empirically show that DeepGpgs improves Matthews correlation coefficient by 6.3% on the arginine methylation independent test set compared with the existing state-of-the-art methylation site prediction methods. Furthermore, DeepGpgs has good robustness in phosphorylation site prediction of SARS-CoV-2, which indicates that DeepGpgs has good transferability and the potential to be extended to other modification sites prediction. The open-source code and data of the DeepGpgs can be obtained from https://github.com/saizhou1/DeepGpgs.
Collapse
Affiliation(s)
- Haiwei Zhou
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Wenxi Tan
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
5
|
Ba W, Jin X, Lu J, Rao Y, Zhang T, Zhang X, Zhou J, Li S. Research on predicting early Fusarium head blight with asymptomatic wheat grains by micro-near infrared spectrometer. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2023; 287:122047. [PMID: 36327806 DOI: 10.1016/j.saa.2022.122047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Revised: 10/17/2022] [Accepted: 10/23/2022] [Indexed: 06/16/2023]
Abstract
Fusarium head blight (FHB) is considered one of the most serious fungal diseases of wheat. Fusarium resulted in yield losses and contamination of harvested grains with mycotoxins. Therefore, diagnosing Fusarium head blight in early asymptomatic wheat is vital. To detect early FHB, a micro-near-infrared spectrometer was used to collect the spectrum of wheat grains, and FHB infection of wheat was detected by combining chemometrics in the 900-1700 nm near-infrared spectral region. First, the obtained spectra were analysed accordingly, and the pre-processed data were compared. The modelling analysis was then performed using the support vector machine (SVM), random forest (RF), extreme gradient descent (XGBoost), Autokeras, and Autogluon (with SVM) algorithms. The results showed that SG smoothing with standard normal variate (SG + SNV) was the best pre-treatment method. In addition, after SG + SNV was combined with the Autogluon (with SVM) model, the optimal classification results were obtained, with an accuracy of 73.33 % and an F1 value of 72.86 %. Autogluon (with SVM) could prevent overfitting and optimize generalization. Then, this manuscript discusses the performance of the Autogluon (with SVM) model with different stacking layers. The results show that one stacking layer can obtain a classification model with excellent performance. These results indicated that the near infrared spectrum (NIR) has the potential for early detection of Fusarium head blight with asymptomatic early statements.
Collapse
Affiliation(s)
- Wenjing Ba
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China
| | - Xiu Jin
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China.
| | - Jie Lu
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Agriculture, Anhui Agricultural University, Hefei 230001, China
| | - Yuan Rao
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China
| | - Tong Zhang
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China
| | - XiaoDan Zhang
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China
| | - Jun Zhou
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China
| | - Shaowen Li
- Anhui Province Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Anhui Agriculture University, Hefei 230001, China; College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China
| |
Collapse
|
6
|
Wei Q, Zhang Q, Gao H, Song T, Salhi A, Yu B. DEEPStack-RBP: Accurate identification of RNA-binding proteins based on autoencoder feature selection and deep stacking ensemble classifier. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
7
|
A novel deep learning-assisted hybrid network for plasmodium falciparum parasite mitochondrial proteins classification. PLoS One 2022; 17:e0275195. [PMID: 36201724 PMCID: PMC9536844 DOI: 10.1371/journal.pone.0275195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 09/12/2022] [Indexed: 11/18/2022] Open
Abstract
Plasmodium falciparum is a parasitic protozoan that can cause malaria, which is a deadly disease. Therefore, the accurate identification of malaria parasite mitochondrial proteins is essential for understanding their functions and identifying novel drug targets. For classifying protein sequences, several adaptive statistical techniques have been devised. Despite significant gains, prediction performance is still constrained by the lack of appropriate feature descriptors and learning strategies in current systems. Moreover, good ground truth data is important for Artificial Intelligence (AI)-based models but there is a lack of that data in the literature. Therefore, in this work, we propose a novel hybrid network that combines 1D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BGRU) to classify the malaria parasite mitochondrial proteins. Furthermore, we curate a sequential data that are collected from National Center for Biotechnology Information (NCBI) and UniProtKB/Swiss-Prot proteins databanks to prepare a dataset that can be used by the research community for AI-based algorithms evaluation. We obtain 4204 cases after preprocessing of the collected data and denote this set of proteins as PF4204. Finally, we conduct an ablation study on several conventional and deep models using PF4204 and the benchmark PF2095 datasets. The proposed model 'CNN-BGRU' obtains the accuracy values of 0.9096 and 0.9857 on PF4204 and PF2095 datasets, respectively. In addition, the CNN-BGRU is compared with state-of-the-arts, where the results illustrate that it can extract robust features and identify proteins accurately.
Collapse
|
8
|
Liu Y, Jin S, Gao H, Wang X, Wang C, Zhou W, Yu B. Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier. Bioinformatics 2021; 38:1223-1230. [PMID: 34864897 PMCID: PMC8690230 DOI: 10.1093/bioinformatics/btab811] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 11/17/2021] [Accepted: 11/30/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Multi-label (ML) protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as coronavirus disease 2019 (COVID-19). RESULTS The article proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition, encoding based on grouped weight, gene ontology, multi-scale continuous and discontinuous, residue probing transformation and evolutionary distance transformation. In the next part, we utilize the ML information latent semantic index method to avoid the interference of redundant information. In the end, ML learning with feature-induced labeling information enrichment is adopted to predict the ML protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy of the first four datasets are 99.23%, 93.82%, 93.24% and 96.72% by the leave-one-out cross validation. It is worth mentioning that the overall actual accuracy prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of ML protein, which provides new ideas for further research on the SCL of ML protein. AVAILABILITY AND IMPLEMENTATION The source codes and datasets are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yushuang Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Xue Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Weifeng Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao 266061, China,College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China,To whom correspondence should be addressed.
| |
Collapse
|
9
|
BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7764764. [PMID: 34484416 PMCID: PMC8413034 DOI: 10.1155/2021/7764764] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 08/13/2021] [Indexed: 01/19/2023]
Abstract
As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.
Collapse
|
10
|
Zhang Q, Zhang Y, Li S, Han Y, Jin S, Gu H, Yu B. Accurate prediction of multi-label protein subcellular localization through multi-view feature learning with RBRL classifier. Brief Bioinform 2021; 22:6127451. [PMID: 33537726 DOI: 10.1093/bib/bbab012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 12/12/2020] [Accepted: 01/06/2021] [Indexed: 01/27/2023] Open
Abstract
Multi-label proteins can participate in carrier transportation, enzyme catalysis, hormone regulation and other life activities. Meanwhile, they play a key role in the fields of biopharmaceuticals, gene and cell therapy. This article proposes a prediction method called Mps-mvRBRL to predict the subcellular localization (SCL) of multi-label protein. Firstly, pseudo position-specific scoring matrix, dipeptide composition, position specific scoring matrix-transition probability composition, gene ontology and pseudo amino acid composition algorithms are used to obtain numerical information from different views. Based on the contribution of five individual feature extraction methods, differential evolution is used for the first time to learn the weight of single feature, and then these original features use a weighted combination method to fuse multi-view information. Secondly, the fused high-dimensional features use a weighted linear discriminant analysis framework based on binary weight form to eliminate irrelevant information. Finally, the best feature vector is input into the joint ranking support vector machine and binary relevance with robust low-rank learning classifier to predict the SCL. After applying leave-one-out cross-validation, the overall actual accuracy (OAA) and overall location accuracy (OLA) of Mps-mvRBRL on the training set of Gram-positive bacteria are both 99.81%. The OAA on the test sets of plant, virus and Gram-negative bacteria datasets are 97.24%, 98.55% and 98.20%, respectively, and the OLA are 97.16%, 97.62% and 98.28%, respectively. The results show that the model achieves good prediction performance for predicting the SCL of multi-label protein.
Collapse
Affiliation(s)
- Qi Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Yandan Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, China
| | - Yu Han
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| |
Collapse
|
11
|
Zhang Q, Liu P, Wang X, Zhang Y, Han Y, Yu B. StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106921] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
12
|
Semwal R, Varadwaj PK. HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network. Curr Genomics 2020; 21:546-557. [PMID: 33214771 PMCID: PMC7604748 DOI: 10.2174/1389202921999200528160534] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 03/27/2020] [Accepted: 03/30/2020] [Indexed: 11/24/2022] Open
Abstract
Aims To develop a tool that can annotate subcellular localization of human proteins. Background With the progression of high throughput human proteomics projects, an enormous amount of protein sequence data has been discovered in the recent past. All these raw sequence data require precise mapping and annotation for their respective biological role and functional attributes. The functional characteristics of protein molecules are highly dependent on the subcellular localization/compartment. Therefore, a fully automated and reliable protein subcellular localization prediction system would be very useful for current proteomic research. Objective To develop a machine learning-based predictive model that can annotate the subcellular localization of human proteins with high accuracy and precision. Methods In this study, we used the PSI-CD-HIT homology criterion and utilized the sequence-based features of protein sequences to develop a powerful subcellular localization predictive model. The dataset used to train the HumDLoc model was extracted from a reliable data source, Uniprot knowledge base, which helps the model to generalize on the unseen dataset. Results The proposed model, HumDLoc, was compared with two of the most widely used techniques: CELLO and DeepLoc, and other machine learning-based tools. The result demonstrated promising predictive performance of HumDLoc model based on various machine learning parameters such as accuracy (≥97.00%), precision (≥0.86), recall (≥0.89), MCC score (≥0.86), ROC curve (0.98 square unit), and precision-recall curve (0.93 square unit). Conclusion In conclusion, HumDLoc was able to outperform several alternative tools for correctly predicting subcellular localization of human proteins. The HumDLoc has been hosted as a web-based tool at https://bioserver.iiita.ac.in/HumDLoc/.
Collapse
Affiliation(s)
- Rahul Semwal
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| | - Pritish Kumar Varadwaj
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| |
Collapse
|
13
|
M A Basher AR, McLaughlin RJ, Hallam SJ. Metabolic pathway inference using multi-label classification with rich pathway features. PLoS Comput Biol 2020; 16:e1008174. [PMID: 33001968 PMCID: PMC7529316 DOI: 10.1371/journal.pcbi.1008174] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Accepted: 07/21/2020] [Indexed: 12/15/2022] Open
Abstract
Metabolic inference from genomic sequence information is a necessary step in determining the capacity of cells to make a living in the world at different levels of biological organization. A common method for determining the metabolic potential encoded in genomes is to map conceptually translated open reading frames onto a database containing known product descriptions. Such gene-centric methods are limited in their capacity to predict pathway presence or absence and do not support standardized rule sets for automated and reproducible research. Pathway-centric methods based on defined rule sets or machine learning algorithms provide an adjunct or alternative inference method that supports hypothesis generation and testing of metabolic relationships within and between cells. Here, we present mlLGPR, multi-label based on logistic regression for pathway prediction, a software package that uses supervised multi-label classification and rich pathway features to infer metabolic networks in organismal and multi-organismal datasets. We evaluated mlLGPR performance using a corpora of 12 experimental datasets manifesting diverse multi-label properties, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previous reports for organismal genomes and identify specific challenges associated with features engineering and training data for community-level metabolic inference.
Collapse
Affiliation(s)
- Abdur Rahman M A Basher
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada
| | - Ryan J McLaughlin
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada
| | - Steven J Hallam
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada
- Department of Microbiology & Immunology, University of British Columbia, 2552-2350 Health Sciences Mall, Vancouver, British Columbia, Canada
- Genome Science and Technology Program, University of British Columbia, 2329 West Mall, Vancouver, BC, Canada
- Life Sciences Institute, University of British Columbia, Vancouver, British Columbia, Canada
- ECOSCOPE Training Program, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
14
|
Bian H, Guo M, Wang J. Recognition of Mitochondrial Proteins in Plasmodium Based on the Tripeptide Composition. Front Cell Dev Biol 2020; 8:578901. [PMID: 33043014 PMCID: PMC7525148 DOI: 10.3389/fcell.2020.578901] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 08/13/2020] [Indexed: 01/31/2023] Open
Abstract
Mitochondria play essential roles in eukaryotic cells, especially in Plasmodium cells. They have several unusual evolutionary and functional features that are incredibly vital for disease diagnosis and drug design. Thus, predicting mitochondrial proteins of Plasmodium has become a worthwhile work. However, existing computational methods can only predict mitochondrial proteins of Plasmodium falciparum (P. falciparum for short), and these methods have low accuracy. It is highly desirable to design a classifier with high accuracy for predicting mitochondrial proteins for all Plasmodium species, not only P. falciparum. We proposed a novel method, named as PM-OTC, for predicting mitochondrial proteins in Plasmodium. PM-OTC uses the Support Vector Machine (SVM) as the classifier and the selected tripeptide composition as the features. We adopted the 5-fold cross-validation method to train and test PM-OTC. Results demonstrate that PM-OTC achieves an accuracy of 94.91%, and performances of PM-OTC are superior to other methods.
Collapse
Affiliation(s)
- Haodong Bian
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China.,Stage Key Laboratories of Reproductive Regulation & Breeding of Grassland Livestock, Hohhot, China
| |
Collapse
|
15
|
Bouziane H, Chouarfia A. Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment. J Integr Bioinform 2020; 18:51-79. [PMID: 32598314 PMCID: PMC8035964 DOI: 10.1515/jib-2019-0091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.
Collapse
Affiliation(s)
- Hafida Bouziane
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| | - Abdallah Chouarfia
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| |
Collapse
|
16
|
Du L, Meng Q, Chen Y, Wu P. Subcellular location prediction of apoptosis proteins using two novel feature extraction methods based on evolutionary information and LDA. BMC Bioinformatics 2020; 21:212. [PMID: 32448129 PMCID: PMC7245797 DOI: 10.1186/s12859-020-3539-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 05/06/2020] [Indexed: 11/13/2022] Open
Abstract
Background Apoptosis, also called programmed cell death, refers to the spontaneous and orderly death of cells controlled by genes in order to maintain a stable internal environment. Identifying the subcellular location of apoptosis proteins is very helpful in understanding the mechanism of apoptosis and designing drugs. Therefore, the subcellular localization of apoptosis proteins has attracted increased attention in computational biology. Effective feature extraction methods play a critical role in predicting the subcellular location of proteins. Results In this paper, we proposed two novel feature extraction methods based on evolutionary information. One of the features obtained the evolutionary information via the transition matrix of the consensus sequence (CTM). And the other utilized the evolutionary information from PSSM based on absolute entropy correlation analysis (AECA-PSSM). After fusing the two kinds of features, linear discriminant analysis (LDA) was used to reduce the dimension of the proposed features. Finally, the support vector machine (SVM) was adopted to predict the protein subcellular locations. The proposed CTM-AECA-PSSM-LDA subcellular location prediction method was evaluated using the CL317 dataset and ZW225 dataset. By jackknife test, the overall accuracy was 99.7% (CL317) and 95.6% (ZW225) respectively. Conclusions The experimental results show that the proposed method which is hopefully to be a complementary tool for the existing methods of subcellular localization, can effectively extract more abundant features of protein sequence and is feasible in predicting the subcellular location of apoptosis proteins.
Collapse
Affiliation(s)
- Lei Du
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Qingfang Meng
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China. .,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China.
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Peng Wu
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| |
Collapse
|
17
|
Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method. BMC Bioinformatics 2019; 20:719. [PMID: 31888447 PMCID: PMC6936157 DOI: 10.1186/s12859-019-3232-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Subcellular localization prediction of protein is an important component of bioinformatics, which has great importance for drug design and other applications. A multitude of computational tools for proteins subcellular location have been developed in the recent decades, however, existing methods differ in the protein sequence representation techniques and classification algorithms adopted. RESULTS In this paper, we firstly introduce two kinds of protein sequences encoding schemes: dipeptide information with space and Gapped k-mer information. Then, the Gapped k-mer calculation method which is based on quad-tree is also introduced. CONCLUSIONS >From the prediction results, this method not only reduces the dimension, but also improves the prediction precision of protein subcellular localization.
Collapse
|
18
|
Xiao X, Chen WJ, Qiu WR. A Novel Prediction of Quaternary Structural Type of Proteins with Gene Ontology. Protein Pept Lett 2019; 27:313-320. [PMID: 31749418 DOI: 10.2174/0929866526666191014144618] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Revised: 05/20/2019] [Accepted: 06/29/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND The information of quaternary structure attributes of proteins is very important because it is closely related to the biological functions of proteins. With the rapid development of new generation sequencing technology, we are facing a challenge: how to automatically identify the four-level attributes of new polypeptide chains according to their sequence information (i.e., whether they are formed as just as a monomer, or as a hetero-oligomer, or a homo-oligomer). OBJECTIVE In this article, our goal is to find a new way to represent protein sequences, thereby improving the prediction rate of protein quaternary structure. METHODS In this article, we developed a prediction system for protein quaternary structural type in which a protein sequence was expressed by combining the Pfam functional-domain and gene ontology. turn protein features into digital sequences, and complete the prediction of quaternary structure through specific machine learning algorithms and verification algorithm. RESULTS Our data set contains 5495 protein samples. Through the method provided in this paper, we classify proteins into monomer, or as a hetero-oligomer, or a homo-oligomer, and the prediction rate is 74.38%, which is 3.24% higher than that of previous studies. Through this new feature extraction method, we can further classify the four-level structure of proteins, and the results are also correspondingly improved. CONCLUSION After the applying the new prediction system, compared with the previous results, we have successfully improved the prediction rate. We have reason to believe that the feature extraction method in this paper has better practicability and can be used as a reference for other protein classification problems.
Collapse
Affiliation(s)
- Xuan Xiao
- School of Information, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Wei-Jie Chen
- School of Information, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Wang-Ren Qiu
- School of Information, Jingdezhen Ceramic Institute, Jingdezhen 333403, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
19
|
Shen Y, Ding Y, Tang J, Zou Q, Guo F. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform 2019; 21:1628-1640. [DOI: 10.1093/bib/bbz106] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 07/23/2019] [Accepted: 07/27/2019] [Indexed: 11/12/2022] Open
Abstract
Abstract
Human protein subcellular localization has an important research value in biological processes, also in elucidating protein functions and identifying drug targets. Over the past decade, a number of protein subcellular localization prediction tools have been designed and made freely available online. The purpose of this paper is to summarize the progress of research on the subcellular localization of human proteins in recent years, including commonly used data sets proposed by the predecessors and the performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO-Hum and pLoc-mHum provide a statistically significant improvement in performance, as measured by the value of accuracy, relative to the other methods. Meanwhile, we build a new data set using the latest version of Uniprot database and construct a new GO-based prediction method HumLoc-LBCI in this paper. Then, we test all selected prediction tools on the new data set. Finally, we discuss the possible development directions of human protein subcellular localization. Availability: The codes and data are available from http://www.lbci.cn/syn/.
Collapse
Affiliation(s)
- Yinan Shen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- School of Computational Science and Engineering, University of South Carolina, Columbia, U.S
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
20
|
Abstract
Background:
Revealing the subcellular location of a newly discovered protein can
bring insight into their function and guide research at the cellular level. The experimental methods
currently used to identify the protein subcellular locations are both time-consuming and expensive.
Thus, it is highly desired to develop computational methods for efficiently and effectively identifying
the protein subcellular locations. Especially, the rapidly increasing number of protein sequences
entering the genome databases has called for the development of automated analysis methods.
Methods:
In this review, we will describe the recent advances in predicting the protein subcellular
locations with machine learning from the following aspects: i) Protein subcellular location benchmark
dataset construction, ii) Protein feature representation and feature descriptors, iii) Common
machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web
servers.
Result & Conclusion:
Concomitant with a large number of protein sequences generated by highthroughput
technologies, four future directions for predicting protein subcellular locations with
machine learning should be paid attention. One direction is the selection of novel and effective features
(e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins.
Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth
one is the protein multiple location sites prediction.
Collapse
Affiliation(s)
- Ting-He Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Shao-Wu Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
21
|
Prediction of Apoptosis Protein Subcellular Localization with Multilayer Sparse Coding and Oversampling Approach. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2436924. [PMID: 30834257 PMCID: PMC6374881 DOI: 10.1155/2019/2436924] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Revised: 01/04/2019] [Accepted: 01/20/2019] [Indexed: 11/29/2022]
Abstract
The prediction of apoptosis protein subcellular localization plays an important role in understanding the progress in cell proliferation and death. Recently computational approaches to this issue have become very popular, since the traditional biological experiments are so costly and time-consuming that they cannot catch up with the growth rate of sequence data anymore. In order to improve the prediction accuracy of apoptosis protein subcellular localization, we proposed a sparse coding method combined with traditional feature extraction algorithm to complete the sparse representation of apoptosis protein sequences, using multilayer pooling based on different sizes of dictionaries to integrate the processed features, as well as oversampling approach to decrease the influences caused by unbalanced data sets. Then the extracted features were input to a support vector machine to predict the subcellular localization of the apoptosis protein. The experiment results obtained by Jackknife test on two benchmark data sets indicate that our method can significantly improve the accuracy of the apoptosis protein subcellular localization prediction.
Collapse
|
22
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
23
|
Hasan MAM, Ahmad S, Molla MKI. Protein subcellular localization prediction using multiple kernel learning based support vector machine. MOLECULAR BIOSYSTEMS 2017; 13:785-795. [DOI: 10.1039/c6mb00860g] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
An efficient multi-label protein subcellular localization prediction system was developed by introducing multiple kernel learning (MKL) based support vector machine (SVM).
Collapse
Affiliation(s)
- Md. Al Mehedi Hasan
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi
- Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi
- Bangladesh
| | | |
Collapse
|
24
|
Wan S, Mak MW, Kung SY. Transductive Learning for Multi-Label Protein Subchloroplast Localization Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:212-224. [PMID: 26887009 DOI: 10.1109/tcbb.2016.2527657] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Predicting the localization of chloroplast proteins at the sub-subcellular level is an essential yet challenging step to elucidate their functions. Most of the existing subchloroplast localization predictors are limited to predicting single-location proteins and ignore the multi-location chloroplast proteins. While recent studies have led to some multi-location chloroplast predictors, they usually perform poorly. This paper proposes an ensemble transductive learning method to tackle this multi-label classification problem. Specifically, given a protein in a dataset, its composition-based sequence information and profile-based evolutionary information are respectively extracted. These two kinds of features are respectively compared with those of other proteins in the dataset. The comparisons lead to two similarity vectors which are weighted-combined to constitute an ensemble feature vector. A transductive learning model based on the least squares and nearest neighbor algorithms is proposed to process the ensemble features. We refer to the resulting predictor to as EnTrans-Chlo. Experimental results on a stringent benchmark dataset and a novel dataset demonstrate that EnTrans-Chlo significantly outperforms state-of-the-art predictors and particularly gains more than 4% (absolute) improvement on the overall actual accuracy. For readers' convenience, EnTrans-Chlo is freely available online at http://bioinfo.eie.polyu.edu.hk/EnTransChloServer/.
Collapse
|
25
|
Wan S, Mak MW, Kung SY. Ensemble Linear Neighborhood Propagation for Predicting Subchloroplast Localization of Multi-Location Proteins. J Proteome Res 2016; 15:4755-4762. [DOI: 10.1021/acs.jproteome.6b00686] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Shibiao Wan
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Man-Wai Mak
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Sun-Yuan Kung
- Department
of Electrical Engineering, Princeton University, New Jersey 08540, United States
| |
Collapse
|
26
|
Wan S, Mak MW, Kung SY. Mem-mEN: Predicting Multi-Functional Types of Membrane Proteins by Interpretable Elastic Nets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:706-718. [PMID: 26336143 DOI: 10.1109/tcbb.2015.2474407] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Membrane proteins play important roles in various biological processes within organisms. Predicting the functional types of membrane proteins is indispensable to the characterization of membrane proteins. Recent studies have extended to predicting single- and multi-type membrane proteins. However, existing predictors perform poorly and more importantly, they are often lack of interpretability. To address these problems, this paper proposes an efficient predictor, namely Mem-mEN, which can produce sparse and interpretable solutions for predicting membrane proteins with single- and multi-label functional types. Given a query membrane protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number, which is subsequently classified by a multi-label elastic net (EN) classifier. Experimental results show that Mem-mEN significantly outperforms existing state-of-the-art membrane-protein predictors. Moreover, by using Mem-mEN, 338 out of more than 7,900 GO terms are found to play more essential roles in determining the functional types. Based on these 338 essential GO terms, Mem-mEN can not only predict the functional type of a membrane protein, but also explain why it belongs to that type. For the reader's convenience, the Mem-mEN server is available online at http://bioinfo.eie.polyu.edu.hk/MemmENServer/.
Collapse
|
27
|
Wan S, Mak MW, Kung SY. Mem-ADSVM: A two-layer multi-label predictor for identifying multi-functional types of membrane proteins. J Theor Biol 2016; 398:32-42. [DOI: 10.1016/j.jtbi.2016.03.013] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 03/07/2016] [Accepted: 03/07/2016] [Indexed: 02/06/2023]
|
28
|
Wan S, Mak MW, Kung SY. Benchmark data for identifying multi-functional types of membrane proteins. Data Brief 2016; 8:105-7. [PMID: 27294176 PMCID: PMC4889873 DOI: 10.1016/j.dib.2016.05.024] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Revised: 05/05/2016] [Accepted: 05/14/2016] [Indexed: 11/18/2022] Open
Abstract
Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. In this article, we provide data that are used for training and testing Mem-ADSVM (Wan et al., 2016. “Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins” [1]), a two-layer multi-label predictor for predicting multi-functional types of membrane proteins.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, New Jersey, USA
| |
Collapse
|
29
|
Wan S, Mak MW, Kung SY. Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins. BMC Bioinformatics 2016; 17:97. [PMID: 26911432 PMCID: PMC4765148 DOI: 10.1186/s12859-016-0940-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 01/27/2016] [Indexed: 11/10/2022] Open
Abstract
Background Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. Results This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. Conclusions Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers’ convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, New Jersey, USA.
| |
Collapse
|
30
|
Chen J, Xu H, He PA, Dai Q, Yao Y. A multiple information fusion method for predicting subcellular locations of two different types of bacterial protein simultaneously. Biosystems 2016; 139:37-45. [DOI: 10.1016/j.biosystems.2015.12.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2015] [Revised: 10/08/2015] [Accepted: 12/10/2015] [Indexed: 12/14/2022]
|
31
|
Thakur A, Rajput A, Kumar M. MSLVP: prediction of multiple subcellular localization of viral proteins using a support vector machine. MOLECULAR BIOSYSTEMS 2016; 12:2572-86. [DOI: 10.1039/c6mb00241b] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Knowledge of the subcellular location (SCL) of viral proteins in the host cell is important for understanding their function in depth.
Collapse
Affiliation(s)
- Anamika Thakur
- Bioinformatics Centre
- Institute of Microbial Technology
- Council of Scientific and Industrial Research
- Chandigarh-160036
- India
| | - Akanksha Rajput
- Bioinformatics Centre
- Institute of Microbial Technology
- Council of Scientific and Industrial Research
- Chandigarh-160036
- India
| | - Manoj Kumar
- Bioinformatics Centre
- Institute of Microbial Technology
- Council of Scientific and Industrial Research
- Chandigarh-160036
- India
| |
Collapse
|
32
|
Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0460-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
33
|
Wan S, Mak MW, Kung SY. mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor. J Theor Biol 2015; 382:223-34. [DOI: 10.1016/j.jtbi.2015.06.042] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Revised: 06/25/2015] [Accepted: 06/26/2015] [Indexed: 02/03/2023]
|
34
|
Liu Z, Hu J. Mislocalization-related disease gene discovery using gene expression based computational protein localization prediction. Methods 2015; 93:119-27. [PMID: 26416496 DOI: 10.1016/j.ymeth.2015.09.022] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Revised: 09/17/2015] [Accepted: 09/21/2015] [Indexed: 01/09/2023] Open
Abstract
Protein sorting is an important mechanism for transporting proteins to their target subcellular locations after their synthesis. Mutations on genes may disrupt the well regulated protein sorting process, leading to a variety of mislocation related diseases. This paper proposes a methodology to discover such disease genes based on gene expression data and computational protein localization prediction. A kernel logistic regression based algorithm is used to successfully identify several candidate cancer genes which may cause cancers due to their mislocation within the cell. Our results also showed that compared to the gene co-expression network defined on Pearson correlation coefficients, the nonlinear Maximum Correlation Coefficients (MIC) based co-expression network give better results for subcellular localization prediction.
Collapse
Affiliation(s)
- Zhonghao Liu
- Department of Computer Science & Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, United States
| | - Jianjun Hu
- Department of Computer Science & Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, United States.
| |
Collapse
|