1
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
2
|
Gu X, Ding Y, Xiao P. MLapRVFL: Protein sequence prediction based on Multi-Laplacian Regularized Random Vector Functional Link. Comput Biol Med 2023; 167:107618. [PMID: 37925912 DOI: 10.1016/j.compbiomed.2023.107618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 10/14/2023] [Accepted: 10/23/2023] [Indexed: 11/07/2023]
Abstract
Protein sequence classification is a crucial research field in bioinformatics, playing a vital role in facilitating functional annotation, structure prediction, and gaining a deeper understanding of protein function and interactions. With the rapid development of high-throughput sequencing technologies, a vast amount of unknown protein sequence data is being generated and accumulated, leading to an increasing demand for protein classification and annotation. Existing machine learning methods still have limitations in protein sequence classification, such as low accuracy and precision of classification models, rendering them less valuable in practical applications. Additionally, these models often lack strong generalization capabilities and cannot be widely applied to various types of proteins. Therefore, accurately classifying and predicting proteins remains a challenging task. In this study, we propose a protein sequence classifier called Multi-Laplacian Regularized Random Vector Functional Link (MLapRVFL). By incorporating Multi-Laplacian and L2,1-norm regularization terms into the basic Random Vector Functional Link (RVFL) method, we effectively improve the model's generalization performance, enhance the robustness and accuracy of the classification model. The experimental results on two commonly used datasets demonstrate that MLapRVFL outperforms popular machine learning methods and achieves superior predictive performance compared to previous studies. In conclusion, the proposed MLapRVFL method makes significant contributions to protein sequence prediction.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, 324003, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 611730, China.
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China.
| |
Collapse
|
3
|
Sreekumar SP, Palanisamy R, Swaminathan R. Semantic Segmentation of Cell Painted Organelles using DeepLabv3plus Model. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2023; 2023:1-4. [PMID: 38082807 DOI: 10.1109/embc40787.2023.10340728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Cell painting based high content fluorescence imaging technique offers deep insight into the functional and biological changes in subcellular structures. However, advanced instrumentation and the limited availability of suitable fluorescent dyes restricts the tool to comprehensively characterize the cell morphology. Therefore, generating fluorescent specific organelle images using transmitted light microscopy provides an alternative solution for clinical applications. In this work, the utility of semantic segmentation deep network for predicting the Endoplasmic Reticulum (ER), cytoplasm and nuclei from a composite image is investigated. To perform this study, a public dataset consisting of 3456 composite images are considered from Broad Bioimage Benchmark collection. The pixel wise labeling is carried out with the generated binary masks for ER, cytoplasm and nuclei. DeepLabv3plus architecture with Atrous Spatial Pyramid Pooling (ASPP) and depth wise separable convolution is used as a learning model to perform semantic segmentation. The accuracy and loss function at different learning rates are analyzed and the segmentation results are validated using Jaccard index, mean Boundary F (BF) score and dice index. The trained model achieved 97.86% accuracy with a loss of 0.07 at the learning rate of 0.01. Mean BF score, dice index and Jaccard index for nuclei, ER and cytoplasm are (0.98, 0.94, 0.88), (0.97, 0.82, 0.7) and (0.95, 0.88, 0.66) respectively. The obtained results indicate that the adopted methodology could delineate the subcellular structures by accurately detecting sharp object boundaries. Therefore, this study could be useful for predicting the cell painted images from transmitted light microscopy without the requirement of fluorescent labeling.
Collapse
|
4
|
Dimitsaki S, Gavriilidis GI, Dimitriadis VK, Natsiavas P. Benchmarking of Machine Learning classifiers on plasma proteomic for COVID-19 severity prediction through interpretable artificial intelligence. Artif Intell Med 2023; 137:102490. [PMID: 36868685 PMCID: PMC9846931 DOI: 10.1016/j.artmed.2023.102490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 01/10/2023] [Accepted: 01/11/2023] [Indexed: 01/19/2023]
Abstract
The SARS-CoV-2 pandemic highlighted the need for software tools that could facilitate patient triage regarding potential disease severity or even death. In this article, an ensemble of Machine Learning (ML) algorithms is evaluated in terms of predicting the severity of their condition using plasma proteomics and clinical data as input. An overview of AI-based technical developments to support COVID-19 patient management is presented outlining the landscape of relevant technical developments. Based on this review, the use of an ensemble of ML algorithms that analyze clinical and biological data (i.e., plasma proteomics) of COVID-19 patients is designed and deployed to evaluate the potential use of AI for early COVID-19 patient triage. The proposed pipeline is evaluated using three publicly available datasets for training and testing. Three ML "tasks" are defined, and several algorithms are tested through a hyperparameter tuning method to identify the highest-performance models. As overfitting is one of the typical pitfalls for such approaches (mainly due to the size of the training/validation datasets), a variety of evaluation metrics are used to mitigate this risk. In the evaluation procedure, recall scores ranged from 0.6 to 0.74 and F1-score from 0.62 to 0.75. The best performance is observed via Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM) algorithms. Additionally, input data (proteomics and clinical data) were ranked based on corresponding Shapley additive explanation (SHAP) values and evaluated for their prognosticated capacity and immuno-biological credence. This "interpretable" approach revealed that our ML models could discern critical COVID-19 cases predominantly based on patient's age and plasma proteins on B cell dysfunction, hyper-activation of inflammatory pathways like Toll-like receptors, and hypo-activation of developmental and immune pathways like SCF/c-Kit signaling. Finally, the herein computational workflow is corroborated in an independent dataset and MLP superiority along with the implication of the abovementioned predictive biological pathways are corroborated. Regarding limitations of the presented ML pipeline, the datasets used in this study contain less than 1000 observations and a significant number of input features hence constituting a high-dimensional low-sample (HDLS) dataset which could be sensitive to overfitting. An advantage of the proposed pipeline is that it combines biological data (plasma proteomics) with clinical-phenotypic data. Thus, in principle, the presented approach could enable patient triage in a timely fashion if used on already trained models. However, larger datasets and further systematic validation are needed to confirm the potential clinical value of this approach. The code is available on Github: https://github.com/inab-certh/Predicting-COVID-19-severity-through-interpretable-AI-analysis-of-plasma-proteomics.
Collapse
Affiliation(s)
- Stella Dimitsaki
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece.
| | - George I Gavriilidis
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| | - Vlasios K Dimitriadis
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| | - Pantelis Natsiavas
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| |
Collapse
|
5
|
Ahmed S, Rahman A, Hasan MAM, Rahman J, Islam MKB, Ahmad S. predML-Site: Predicting Multiple Lysine PTM Sites With Optimal Feature Representation and Data Imbalance Minimization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3624-3634. [PMID: 34546927 DOI: 10.1109/tcbb.2021.3114349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Identifying of post-translational modifications (PTM) is crucial in the study of computational proteomics, cell biology, pathogenesis, and drug development due to its role in many bio-molecular mechanisms. Computational methods for predicting multiple PTM at the same lysine residues, often referred to as K-PTM, is still evolving. This paper presents a novel computational tool, abbreviated as predML-Site, for predicting KPTM, such as acetylation, crotonylation, methylation, succinylation from an uncategorized peptide sample involving single, multiple, or no modification. For informative feature representation, multiple sequence encoding schemes, such as the sequence-coupling, binary encoding, k-spaced amino acid pairs, amino acid factor have been used with ANOVA and incremental feature selection. As a core predictor, a cost-sensitive SVM classifier has been adopted which effectively mitigates the effect of class-label imbalance in the dataset. predML-Site predicts multi-label PTM sites with 84.18% accuracy using the top 91 features. It has also achieved 85.34% aiming and 86.58% coverage rate which are much better than the existing state-of-the-art predictors on the same rigorous validation test. This performance indicates that predML-Site can be used as a supportive tool for further K-PTM study. For the convenience of the experimental scientists, predML-Site has been deployed as a user-friendly web-server at http://103.99.176.239/predML-Site.
Collapse
|
6
|
Wei Z, Liu X, Yan R, Sun G, Yu W, Liu Q, Guo Q. Pixel-level multimodal fusion deep networks for predicting subcellular organelle localization from label-free live-cell imaging. Front Genet 2022; 13:1002327. [PMID: 36386823 PMCID: PMC9644055 DOI: 10.3389/fgene.2022.1002327] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 09/26/2022] [Indexed: 01/25/2023] Open
Abstract
Complex intracellular organizations are commonly represented by dividing the metabolic process of cells into different organelles. Therefore, identifying sub-cellular organelle architecture is significant for understanding intracellular structural properties, specific functions, and biological processes in cells. However, the discrimination of these structures in the natural organizational environment and their functional consequences are not clear. In this article, we propose a new pixel-level multimodal fusion (PLMF) deep network which can be used to predict the location of cellular organelle using label-free cell optical microscopy images followed by deep-learning-based automated image denoising. It provides valuable insights that can be of tremendous help in improving the specificity of label-free cell optical microscopy by using the Transformer-Unet network to predict the ground truth imaging which corresponds to different sub-cellular organelle architectures. The new prediction method proposed in this article combines the advantages of a transformer's global prediction and CNN's local detail analytic ability of background features for label-free cell optical microscopy images, so as to improve the prediction accuracy. Our experimental results showed that the PLMF network can achieve over 0.91 Pearson's correlation coefficient (PCC) correlation between estimated and true fractions on lung cancer cell-imaging datasets. In addition, we applied the PLMF network method on the cell images for label-free prediction of several different subcellular components simultaneously, rather than using several fluorescent labels. These results open up a new way for the time-resolved study of subcellular components in different cells, especially for cancer cells.
Collapse
Affiliation(s)
- Zhihao Wei
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Xi Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Ruiqing Yan
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Guocheng Sun
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China,School of Mechanical Engineering & Hydrogen Energy Research Centre, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Weiyong Yu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Qiang Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Qianjin Guo
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China,School of Mechanical Engineering & Hydrogen Energy Research Centre, Beijing Institute of Petrochemical Technology, Beijing, China,*Correspondence: Qianjin Guo,
| |
Collapse
|
7
|
Multiple Parallel Fusion Network for Predicting Protein Subcellular Localization from Stimulated Raman Scattering (SRS) Microscopy Images in Living Cells. Int J Mol Sci 2022; 23:ijms231810827. [PMID: 36142736 PMCID: PMC9504098 DOI: 10.3390/ijms231810827] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 09/10/2022] [Accepted: 09/13/2022] [Indexed: 11/23/2022] Open
Abstract
Stimulated Raman Scattering Microscopy (SRS) is a powerful tool for label-free detailed recognition and investigation of the cellular and subcellular structures of living cells. Determining subcellular protein localization from the cell level of SRS images is one of the basic goals of cell biology, which can not only provide useful clues for their functions and biological processes but also help to determine the priority and select the appropriate target for drug development. However, the bottleneck in predicting subcellular protein locations of SRS cell imaging lies in modeling complicated relationships concealed beneath the original cell imaging data owing to the spectral overlap information from different protein molecules. In this work, a multiple parallel fusion network, MPFnetwork, is proposed to study the subcellular locations from SRS images. This model used a multiple parallel fusion model to construct feature representations and combined multiple nonlinear decomposing algorithms as the automated subcellular detection method. Our experimental results showed that the MPFnetwork could achieve over 0.93 dice correlation between estimated and true fractions on SRS lung cancer cell datasets. In addition, we applied the MPFnetwork method to cell images for label-free prediction of several different subcellular components simultaneously, rather than using several fluorescent labels. These results open up a new method for the time-resolved study of subcellular components in different cells, especially cancer cells.
Collapse
|
8
|
Cao X, Xing L, Majd E, He H, Gu J, Zhang X. A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data. Front Genet 2022; 13:836798. [PMID: 35281805 PMCID: PMC8905542 DOI: 10.3389/fgene.2022.836798] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/18/2022] [Indexed: 11/13/2022] Open
Abstract
The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers.
Collapse
Affiliation(s)
- Xiaowen Cao
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China.,Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Li Xing
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Elham Majd
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Hua He
- School of Science, Hebei University of Technology, Tianjin, China
| | - Junhua Gu
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xuekui Zhang
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| |
Collapse
|
9
|
Pan G, Sun C, Liao Z, Tang J. Machine and Deep Learning for Prediction of Subcellular Localization. Methods Mol Biol 2022; 2361:249-261. [PMID: 34236666 DOI: 10.1007/978-1-0716-1641-3_15] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Protein subcellular localization prediction (PSLP), which plays an important role in the field of computational biology, identifies the position and function of proteins in cells without expensive cost and laborious effort. In the past few decades, various methods with different algorithms have been proposed in solving the problem of subcellular localization prediction; machine learning and deep learning constitute a large portion among those proposed methods. In order to provide an overview about those methods, the first part of this article will be a brief review of several state-of-the-art machine learning methods on subcellular localization prediction; then the materials used by subcellular localization prediction is described and a simple prediction method, that takes protein sequences as input and utilizes a convolutional neural network as the classifier, is introduced. At last, a list of notes is provided to indicate the major problems that may occur with this method.
Collapse
Affiliation(s)
- Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Zijun Liao
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fujian, China
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA. .,School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
10
|
Deep Learning-Based Prediction of Throttle Value and State for Wheel Loaders. ENERGIES 2021. [DOI: 10.3390/en14217202] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Accurate prediction of the throttle value and state for wheel loaders can help to achieve autonomous operation, thereby reducing the cost and accident rate. However, existing methods based on a physical model cannot accurately reflect the operator’s driving habits and the interaction between wheel loaders and the environment. In this paper, a deep-learning-based prediction model is developed to predict the throttle value and state for wheel loaders by learning from driving data. Multiple long–short-term memory (LSTM) networks are used to extract the temporal features of different stages during the operation of the wheel loader. Two backward-propagation neural networks (BPNNs), which use the temporal feature extracted by LSTM as the input, are designed to output the final prediction results of throttle value and state, respectively. The proposed prediction model is trained and tested using the data from two different conditions. The end-to-end LSTM prediction model and BPNNs are used as benchmark models. The results indicate that the proposed prediction model has good prediction accuracy and adaptability. Furthermore, the relationship between the prediction performance and signal sampling frequency is also studied. The proposed prediction method that combines driving data and deep learning can make the throttle action conform to the decisions of an experienced operator, providing technical support for the autonomous operation of construction machinery.
Collapse
|
11
|
Basgalupp M, Cerri R, Schietgat L, Triguero I, Vens C. Beyond global and local multi-target learning. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Jiang Y, Wang D, Wang W, Xu D. Computational methods for protein localization prediction. Comput Struct Biotechnol J 2021; 19:5834-5844. [PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open
Abstract
The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.
Collapse
Affiliation(s)
- Yuexu Jiang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Weiwei Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
13
|
Ahmed S, Rahman A, Hasan MAM, Ahmad S, Shovan SM. Computational identification of multiple lysine PTM sites by analyzing the instance hardness and feature importance. Sci Rep 2021; 11:18882. [PMID: 34556767 PMCID: PMC8460736 DOI: 10.1038/s41598-021-98458-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 09/08/2021] [Indexed: 02/08/2023] Open
Abstract
Identification of post-translational modifications (PTM) is significant in the study of computational proteomics, cell biology, pathogenesis, and drug development due to its role in many bio-molecular mechanisms. Though there are several computational tools to identify individual PTMs, only three predictors have been established to predict multiple PTMs at the same lysine residue. Furthermore, detailed analysis and assessment on dataset balancing and the significance of different feature encoding techniques for a suitable multi-PTM prediction model are still lacking. This study introduces a computational method named 'iMul-kSite' for predicting acetylation, crotonylation, methylation, succinylation, and glutarylation, from an unrecognized peptide sample with one, multiple, or no modifications. After successfully eliminating the redundant data samples from the majority class by analyzing the hardness of the sequence-coupling information, feature representation has been optimized by adopting the combination of ANOVA F-Test and incremental feature selection approach. The proposed predictor predicts multi-label PTM sites with 92.83% accuracy using the top 100 features. It has also achieved a 93.36% aiming rate and 96.23% coverage rate, which are much better than the existing state-of-the-art predictors on the validation test. This performance indicates that 'iMul-kSite' can be used as a supportive tool for further K-PTM study. For the convenience of the experimental scientists, 'iMul-kSite' has been deployed as a user-friendly web-server at http://103.99.176.239/iMul-kSite .
Collapse
Affiliation(s)
- Sabit Ahmed
- grid.443086.d0000 0004 1755 355XComputer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, 6204 Bangladesh
| | - Afrida Rahman
- grid.443086.d0000 0004 1755 355XComputer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, 6204 Bangladesh
| | - Md. Al Mehedi Hasan
- grid.443086.d0000 0004 1755 355XComputer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, 6204 Bangladesh
| | - Shamim Ahmad
- grid.412656.20000 0004 0451 7306Computer Science and Engineering, University of Rajshahi, Rajshahi, 6205 Bangladesh
| | - S. M. Shovan
- grid.443086.d0000 0004 1755 355XComputer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, 6204 Bangladesh
| |
Collapse
|
14
|
Islam MKB, Rahman J, Hasan MAM, Ahmad S. predForm-Site: Formylation site prediction by incorporating multiple features and resolving data imbalance. Comput Biol Chem 2021; 94:107553. [PMID: 34384997 DOI: 10.1016/j.compbiolchem.2021.107553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 06/22/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Formylation is one of the newly discovered post-translational modifications in lysine residue which is responsible for different kinds of diseases. In this work, a novel predictor, named predForm-Site, has been developed to predict formylation sites with higher accuracy. We have integrated multiple sequence features for developing a more informative representation of formylation sites. Moreover, decision function of the underlying classifier have been optimized on skewed formylation dataset during prediction model training for prediction quality improvement. On the dataset used by LFPred and Formator predictor, predForm-Site achieved 99.5% sensitivity, 99.8% specificity and 99.8% overall accuracy with AUC of 0.999 in the jackknife test. In the independent test, it has also achieved more than 97% sensitivity and 99% specificity. Similarly, in benchmarking with recent method CKSAAP_FormSite, the proposed predictor significantly outperformed in all the measures, particularly sensitivity by around 20%, specificity by nearly 30% and overall accuracy by more than 22%. These experimental results show that the proposed predForm-Site can be used as a complementary tool for the fast exploration of formylation sites. For convenience of the scientific community, predForm-Site has been deployed as an online tool, accessible at http://103.99.176.239:8080/predForm-Site.
Collapse
Affiliation(s)
- Md Khaled Ben Islam
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.
| | - Julia Rahman
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh.
| | - Md Al Mehedi Hasan
- Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering, Rajshahi University, Rajshahi, Bangladesh
| |
Collapse
|
15
|
Xie X, Yue S, Shi B, Li H, Cui Y, Wang J, Yang P, Li S, Li X, Bian S. Comprehensive Analysis of the SBP Family in Blueberry and Their Regulatory Mechanism Controlling Chlorophyll Accumulation. FRONTIERS IN PLANT SCIENCE 2021; 12:703994. [PMID: 34276754 PMCID: PMC8281205 DOI: 10.3389/fpls.2021.703994] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 06/09/2021] [Indexed: 06/13/2023]
Abstract
SQUAMOSA Promoter Binding Protein (SBP) family genes act as central players to regulate plant growth and development with functional redundancy and specificity. Addressing the diversity of the SBP family in crops is of great significance to precisely utilize them to improve agronomic traits. Blueberry is an important economic berry crop. However, the SBP family has not been described in blueberry. In the present study, twenty VcSBP genes were identified through data mining against blueberry transcriptome databases. These VcSBPs could be clustered into eight groups, and the gene structures and motif compositions are divergent among the groups and similar within each group. The VcSBPs were differentially expressed in various tissues. Intriguingly, 10 VcSBPs were highly expressed at green fruit stages and dramatically decreased at the onset of fruit ripening, implying that they are important regulators during early fruit development. Computational analysis showed that 10 VcSBPs were targeted by miR156, and four of them were further verified by degradome sequencing. Moreover, their functional diversity was studied in Arabidopsis. Noticeably, three VcSBPs significantly increased chlorophyll accumulation, and qRT-PCR analysis indicated that VcSBP13a in Arabidopsis enhanced the expression of chlorophyll biosynthetic genes such as AtDVR, AtPORA, AtPORB, AtPORC, and AtCAO. Finally, the targets of VcSBPs were computationally identified in blueberry, and the Y1H assay showed that VcSBP13a could physically bind to the promoter region of the chlorophyll-associated gene VcLHCB1. Our findings provided an overall framework for individually understanding the characteristics and functions of the SBP family in blueberry.
Collapse
Affiliation(s)
- Xin Xie
- College of Plant Science, Jilin University, Changchun, China
| | - Shaokang Yue
- College of Plant Science, Jilin University, Changchun, China
| | - Baosheng Shi
- College of Landscape Architecture and Tourism, Hebei Agricultural University, Baoding, China
| | - Hongxue Li
- College of Plant Science, Jilin University, Changchun, China
| | - Yuhai Cui
- London Research and Development Centre, Agriculture and Agri-Food Canada, London, ON Canada
- Department of Biology, Western University, London, ON, Canada
| | - Jingying Wang
- College of Plant Science, Jilin University, Changchun, China
| | - Pengjie Yang
- College of Plant Science, Jilin University, Changchun, China
| | - Shuchun Li
- Department of Pain, Second Hospital of Jilin University, Changchun, China
| | - Xuyan Li
- College of Plant Science, Jilin University, Changchun, China
| | - Shaomin Bian
- College of Plant Science, Jilin University, Changchun, China
| |
Collapse
|
16
|
Zhang Q, Zhang Y, Li S, Han Y, Jin S, Gu H, Yu B. Accurate prediction of multi-label protein subcellular localization through multi-view feature learning with RBRL classifier. Brief Bioinform 2021; 22:6127451. [PMID: 33537726 DOI: 10.1093/bib/bbab012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 12/12/2020] [Accepted: 01/06/2021] [Indexed: 01/27/2023] Open
Abstract
Multi-label proteins can participate in carrier transportation, enzyme catalysis, hormone regulation and other life activities. Meanwhile, they play a key role in the fields of biopharmaceuticals, gene and cell therapy. This article proposes a prediction method called Mps-mvRBRL to predict the subcellular localization (SCL) of multi-label protein. Firstly, pseudo position-specific scoring matrix, dipeptide composition, position specific scoring matrix-transition probability composition, gene ontology and pseudo amino acid composition algorithms are used to obtain numerical information from different views. Based on the contribution of five individual feature extraction methods, differential evolution is used for the first time to learn the weight of single feature, and then these original features use a weighted combination method to fuse multi-view information. Secondly, the fused high-dimensional features use a weighted linear discriminant analysis framework based on binary weight form to eliminate irrelevant information. Finally, the best feature vector is input into the joint ranking support vector machine and binary relevance with robust low-rank learning classifier to predict the SCL. After applying leave-one-out cross-validation, the overall actual accuracy (OAA) and overall location accuracy (OLA) of Mps-mvRBRL on the training set of Gram-positive bacteria are both 99.81%. The OAA on the test sets of plant, virus and Gram-negative bacteria datasets are 97.24%, 98.55% and 98.20%, respectively, and the OLA are 97.16%, 97.62% and 98.28%, respectively. The results show that the model achieves good prediction performance for predicting the SCL of multi-label protein.
Collapse
Affiliation(s)
- Qi Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Yandan Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, China
| | - Yu Han
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| |
Collapse
|
17
|
Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest. ELECTRONICS 2020. [DOI: 10.3390/electronics9101566] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Protein subnuclear localization plays an important role in proteomics, and can help researchers to understand the biologic functions of nucleus. To date, most protein datasets used by studies are unbalanced, which reduces the prediction accuracy of protein subnuclear localization—especially for the minority classes. In this work, a novel method is therefore proposed to predict the protein subnuclear localization of unbalanced datasets. First, the position-specific score matrix is used to extract the feature vectors of two benchmark datasets and then the useful features are selected by kernel linear discriminant analysis. Second, the Radius-SMOTE is used to expand the samples of minority classes to deal with the problem of imbalance in datasets. Finally, the optimal feature vectors of the expanded datasets are classified by random forest. In order to evaluate the performance of the proposed method, four index evolutions are calculated by Jackknife test. The results indicate that the proposed method can achieve better effect compared with other conventional methods, and it can also improve the accuracy for both majority and minority classes effectively.
Collapse
|
18
|
Learning Distance Metric for Support Vector Machine: A Multiple Kernel Learning Approach. Neural Process Lett 2019. [DOI: 10.1007/s11063-019-10053-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
19
|
Ruan X, Zhou D, Nie R, Hou R, Cao Z. Prediction of apoptosis protein subcellular location based on position-specific scoring matrix and isometric mapping algorithm. Med Biol Eng Comput 2019; 57:2553-2565. [PMID: 31621050 DOI: 10.1007/s11517-019-02045-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 09/04/2019] [Indexed: 01/04/2023]
Abstract
Apoptosis proteins are related to many diseases. Obtaining the subcellular localization information of apoptosis proteins is helpful to understand the mechanism of diseases and to develop new drugs. At present, the researchers mainly focus on the primary protein sequences, so there is still room for improvement in the prediction accuracy of the subcellular localization of apoptosis proteins. In this paper, a new method named ERT-ECT-PSSM-IS is proposed to predict apoptosis proteins based on the position-specific scoring matrix (PSSM). First, the local and global features of different directions are extracted by evolutionary row transformation (ERT) and cross-covariance of evolutionary column transformation (ECT) based on PSSM (ERT-ECT-PSSM). Second, an improved isometric mapping algorithm (I-SMA) is used to eliminate redundant features. Finally, we adopt a support vector machine (SVM) to classify our results, and the prediction accuracy is evaluated by jackknife cross-validation tests. The experimental results show that the proposed method not only extracts more abundant feature expression but also has better predictive performance and robustness for the subcellular localization of apoptosis proteins in ZD98, ZW225, and CL317 databases. Graphical abstract Framework of the proposed prediction model.
Collapse
Affiliation(s)
- Xiaoli Ruan
- Information College, Yunnan University, Kunming, 650504, China
| | - Dongming Zhou
- Information College, Yunnan University, Kunming, 650504, China.
| | - Rencan Nie
- Information College, Yunnan University, Kunming, 650504, China
| | - Ruichao Hou
- Information College, Yunnan University, Kunming, 650504, China
| | - Zicheng Cao
- School of Public Health, Sun Yat-sen University, Shenzhen, 510080, China
| |
Collapse
|
20
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
21
|
Abstract
The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC+, a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC+ classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC+ classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC+ classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.
Collapse
|
22
|
Wang S, Yue Y. Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm. PLoS One 2018; 13:e0195636. [PMID: 29649330 PMCID: PMC5896989 DOI: 10.1371/journal.pone.0195636] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 03/26/2018] [Indexed: 01/03/2023] Open
Abstract
A wide variety of methods have been proposed in protein subnuclear localization to improve the prediction accuracy. However, one important trend of these means is to treat fusion representation by fusing multiple feature representations, of which, the fusion process takes a lot of time. In view of this, this paper novelly proposed a method by combining a new single feature representation and a new algorithm to obtain good recognition rate. Specifically, based on the position-specific scoring matrix (PSSM), we proposed a new expression, correlation position-specific scoring matrix (CoPSSM) as the protein feature representation. Based on the classic nonlinear dimension reduction algorithm, kernel linear discriminant analysis (KLDA), we added a new discriminant criterion and proposed a dichotomous greedy genetic algorithm (DGGA) to intelligently select its kernel bandwidth parameter. Two public datasets with Jackknife test and KNN classifier were used for the numerical experiments. The results showed that the overall success rate (OSR) with single representation CoPSSM is larger than that with many relevant representations. The OSR of the proposed method can reach as high as 87.444% and 90.3361% for these two datasets, respectively, outperforming many current methods. To show the generalization of the proposed algorithm, two extra standard datasets of protein subcellular were chosen to conduct the expending experiment, and the prediction accuracy by Jackknife test and Independent test is still considerable.
Collapse
Affiliation(s)
- Shunfang Wang
- School of Information Science and Engineering, Yunnan University, Kunming, PR China
- * E-mail:
| | - Yaoting Yue
- School of Information Science and Engineering, Yunnan University, Kunming, PR China
| |
Collapse
|
23
|
Wang M, Wang T, Li A. ksrMKL: a novel method for identification of kinase-substrate relationships using multiple kernel learning. PeerJ 2017; 5:e4182. [PMID: 29340231 PMCID: PMC5741978 DOI: 10.7717/peerj.4182] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 12/01/2017] [Indexed: 01/24/2023] Open
Abstract
Phosphorylation exerts a crucial role in multiple biological cellular processes which is catalyzed by protein kinases and closely related to many diseases. Identification of kinase-substrate relationships is important for understanding phosphorylation and provides a fundamental basis for further disease-related research and drug design. In this study, we develop a novel computational method to identify kinase-substrate relationships based on multiple kernel learning. The comparative analysis is based on a 10-fold cross-validation process and the dataset collected from the Phospho.ELM database. The results show that ksrMKL is greatly improved in various measures when compared with the single kernel support vector machine. Furthermore, with an independent test dataset extracted from the PhosphoSitePlus database, we compare ksrMKL with two existing kinase-substrate relationship prediction tools, namely iGPS and PKIS. The experimental results show that ksrMKL has better prediction performance than these existing tools.
Collapse
Affiliation(s)
- Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, China
| | - Tao Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, China
| |
Collapse
|
24
|
Wang S, Nie B, Yue K, Fei Y, Li W, Xu D. Protein Subcellular Localization with Gaussian Kernel Discriminant Analysis and Its Kernel Parameter Selection. Int J Mol Sci 2017; 18:E2718. [PMID: 29244758 PMCID: PMC5751319 DOI: 10.3390/ijms18122718] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Revised: 12/04/2017] [Accepted: 12/05/2017] [Indexed: 11/16/2022] Open
Abstract
Kernel discriminant analysis (KDA) is a dimension reduction and classification algorithm based on nonlinear kernel trick, which can be novelly used to treat high-dimensional and complex biological data before undergoing classification processes such as protein subcellular localization. Kernel parameters make a great impact on the performance of the KDA model. Specifically, for KDA with the popular Gaussian kernel, to select the scale parameter is still a challenging problem. Thus, this paper introduces the KDA method and proposes a new method for Gaussian kernel parameter selection depending on the fact that the differences between reconstruction errors of edge normal samples and those of interior normal samples should be maximized for certain suitable kernel parameters. Experiments with various standard data sets of protein subcellular localization show that the overall accuracy of protein classification prediction with KDA is much higher than that without KDA. Meanwhile, the kernel parameter of KDA has a great impact on the efficiency, and the proposed method can produce an optimum parameter, which makes the new algorithm not only perform as effectively as the traditional ones, but also reduce the computational time and thus improve efficiency.
Collapse
Affiliation(s)
- Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China.
| | - Bing Nie
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China.
| | - Kun Yue
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China.
| | - Yu Fei
- School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming 650221, China.
| | - Wenjia Li
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China.
| | - Dongshu Xu
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China.
| |
Collapse
|
25
|
Hasan MAM, Ahmad S, Molla MKI. iMulti-HumPhos: a multi-label classifier for identifying human phosphorylated proteins using multiple kernel learning based support vector machines. MOLECULAR BIOSYSTEMS 2017; 13:1608-1618. [DOI: 10.1039/c7mb00180k] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
An efficient multi-label classifier for identifying human phosphorylated proteins has been developed by introducing multiple kernel learning based support vector machines.
Collapse
Affiliation(s)
- Md. Al Mehedi Hasan
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi 6205
- Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi 6205
- Bangladesh
| | | |
Collapse
|